7 Branch&Pennypacker Generalization and Generality
-
Upload
marjorie-wanderley -
Category
Documents
-
view
15 -
download
0
Transcript of 7 Branch&Pennypacker Generalization and Generality
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
1/25
151
DOI: 10.1037/13937-007APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles,G. J. Madden (Editor-in-Chief)Copyright 2013 by the American Psychological Association. All rights reserved.
C H A P T E R 7
GENERALITY AND GENERALIZATION
OF RESEARCH FINDINGS
Marc N. Branch and Henry S. Pennypacker
For generalization, psychologists must
finally rely, as has been done in all the
older sciences, on replication. (Cohen,
1994, p. 997)
Confirmation comes from repetition. . . .
Repetition is the basis for judging . . . sig-
nificance and confidence. (Tukey, 1969,
pp. 8485)
As the general psychology research community
becomes increasingly aware (e.g., Cohen, 1994; Lof-
tus, 1991, 1996; Wilkinson & Task Force on Statisti-
cal Inference, 1999) of the limitations of traditional
group designs and statistical inference methods with
regard to assessing reliability and generality of
research findings, we present an alternative approach
that has been substantially developed in the branch of
psychology now known as behavior analysis. In this
chapter, we outline how individual subject methods,
that is, so-called single-case designs, provide straight-
forward and, in principle, simple methods to assess
the reliability and generality of research findings.
OVERVIEW
The chapter consists of three major sections. Inthe first, we summarize the limitations of tradi-
tional methods, especially as they relate to assess-
ing reliability and generality of research findings
concerning behavior. We make the case that tradi-
tional methods have obscured an important dis-
tinction that has led to psychologys consisting of
two related, but separable, subject matters, behav-
ioral science and actuarial science. We also focus
on the issue of generality across individuals and
how traditional methods can give the illusion of
such generality. In the second major section, we
discuss dimensions of generality in addition to
generality across individuals. Here we define scien-
tific generalityand several other forms of generality
as well. In so doing, we introduce the roles of rep-
lication, both direct and systematic, in assessing
generality of research results. We argue that repli-
cation, instead of statistical inference, is an alter-
native primary method for determining not only
the reliability of results but also for assessing and
characterizing the generality of scientific findings.
In the third major section, we discuss generaliza-
tion of treatment effects, the fundamentals of tech-
nology transfer, and the practices that characterize
translational research. There, we write of program-
ming for and assessment of generalizability of sci-
entific findings to applied settings. We expand our
view then to the engineering issues of technology
development (or technology transfer and transla-
tional research) as a capstone demonstration of
generalization based on an understanding of gen-
erality of research findings.
LIMITATIONS OF TRADITIONAL
METHODS
The traditional group-mean, statistical-inference
approach to analyzing research results has faced
Preparation of this chapter was supported by National Institute on Drug Abuse Grant DA004074.
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
2/25
Branch and Pennypacker
152
consistent criticism for more than 4 decades (e.g.,
Bakan, 1966; Carver, 1978; Cohen, 1994; Gigeren-
zer, Krauss, & Vitouch, 2004; Loftus, 1991, 1996;
Meehl, 1967, 1978; Nickerson, 2000; Rozeboom,
1960). Most of that criticism has focused on what
those methods have to say about the reliability of
research findings, which is appropriate because iffindings are not reliable, there is no need to assess
their generality. These methods, however, have also
been criticized with respect to theory testing and
development, issues that directly relate to generality.
We treat these two categories of criticism separately.
Significance Testing and ReliabilityAfter all of the carefully reasoned criticism of signifi-
cance testing that has been published, one would
hope that a clear understanding of its limits would
exist among professional psychologists. That, how-ever, appears not to be true, as noted by Cohen
(1994), who lamented that
after 4 decades of severe criticism, the
ritual of null hypothesis significance
testing . . . still persists. [As does] near
universal misinterpretation ofpas the
probability that H-sub-0 is false, [and]
the misinterpretation that its comple-
ment is the probability of successful rep-
lication. (p. 997)Cohens assertion is supported by survey evidence
revealing that a substantial majority of academic
research psychologists incorrectly interpretpvalues
and statistical significance (Haller & Krauss, 2002;
Kalinowski, Fidler, & Cumming, 2008; Oakes,
1986). That a significant proportion of professional
psychologists do not appreciate what statistical
significance and, especially,pvalues represent is
apparent testimony to a weakness in the training of
research psychologists, a failing that lies at the feet
of those of us who are engaged in teaching them. Infact, Haller and Krauss (2002) included a sample
of statistical methodology instructors in their study
and found that 80% of them were mistaken in their
understanding ofpvalues, so it comes as less of a
surprise that the misconceptions are widespread. The
following discussion, therefore, is another attempt to
make clear what apvalue is and what it means.
Apvalue, which results from a significance test,
is a conditional probability. Specifically, it is the
probability, if the null hypothesis is true, of obtain-
ing data of a particular sort. That is, in algebraic
symbols, it isp=P(Data|H0). The important point
is thatpP(H0|Data), which is what a researcher
would presumably really like to know. In otherwords, apvalue does not provide quantitative infor-
mation about whether the null hypothesis is true,
which is apparently widely misunderstood. Because
it does not provide the oft-assumed information
about the likelihood of the null hypothesis being
true, apvalue of .01 does not mean that the proba-
bility of the null hypothesis being true is 1 in 100.
In fact, it conveys nothing quantitative about the
truth of the null hypothesis. To see why, note that
changing the order of conditionality in a condi-
tional probability is crucially important. Considersuch examples as P(Dead|Electrocuted) versus
P(Electrocuted|Dead) or P(Cloudy|Raining) versus
P(Raining|Cloudy). The first probability in each pair
tells nothing about the second, just as P(Data|H0)
reveals nothing about P(H0|Data). Apvalue, there-
fore, has quantitative meaning only if the null
hypothesis is true, but when performing statistical
tests not only does one not know whether the null
hypothesis is true, one probably assumes it is not.
The important fact is that a finding of statistical sig-
nificance, via a smallpvalue, does not imply thatthe null hypothesis is unlikely to be true. The incor-
rect logic underlying the mistaken conclusion (cf.
Falk & Greenbaum, 1995) apparently goes as fol-
lows: If the null hypothesis is true, data of a certain
sort are unlikely. I obtained data of that sort, so
therefore the null hypothesis is unlikely to be true.
That so-called logic is precisely the same as the fol-
lowing: If the next person I meet is an American, he
or she is unlikely to be the President. I just met the
President. Therefore, he or she is unlikely to be an
American.The fundamental misunderstanding of what ap
value is leads directly to the more serious problem
of assuming that it indicates something quantitative
about the reliability, that is, the likelihood of repli-
cation, of the finding. A common misunderstanding
(see Haller & Krauss, 2002, and Oakes, 1986, for
evidence) is that apvalue, for example of .01, is the
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
3/25
Generality and Generalization of Research Findings
153
complement of the probability of replication should
the experiment be repeated. That is, the mistaken
assumption is that if one conducted the experiment
100 times, one should replicate the result on 99 of
those occasions (at least on average). If one knew
that the null hypotheses were true, then that would
be a correct interpretation of thepvalue. Of course,though, one does not know whether H0is true
(again, one usually hopes it is not). In fact, one con-
ducts the statistical test so that one can make what
one (mistakenly) hopes is an educated guess about
whether it is true. Thus, to say on the basis of a
smallpvalue that a result is statistically reliable is
to strain the meaning of reliablebeyond reasonable
limits.
This limitation of statistical significance is not
based on technical details of the null hypothesis.
That is, the problem does not lie with whether theunderlying distribution is formally normal or near
normal or whether the statistical test involved is
demonstrably robust with respect to violations of
assumptions about the underlying distribution. The
limitation is based in the logic of the approach. All
the assumptions about the distributional character-
istic null hypothesis might in fact be true, but that is
not relevant when one is speaking of what apvalue
indicates.
A major limitation of statistical significance,
therefore, is that it does not provide direct informa-tion about the reliability of research findings. With-
out knowledge about reliability, no examination of
generality can occur because repeatability is the
most basic test of generality. Notwithstanding that
limitation, however, significance testing based on
group means may be seen, incorrectly, to have
implications for generality of findings across sub-
jects. Adherence to this view unfortunately gains
strength as sample size increases. In fact, however,
regardless of sample size, no information about
intersubject generality can be extracted from asignificance statement because no knowledge is
afforded concerning the number of subjects for
whom the effect actually occurred. We examine
the implications of this fact in more detail below.
Aside from the limits surrounding reliability just
described, other characteristics of group-mean data
warrant examination as we move into a discussion
of generality. It is here that we show that psychol-
ogy, presumably because of the widespread use of
significance testing, has developed two distinguish-
able subject matters.
Significance Testing and Generality
Traditional significance testing approaches in psy-chology are generally based on data averaged across
individuals. As is well known, the mean from a
group of individuals (a sample) provides an estimate
of the mean of the entire population from which the
sample is drawn, and that estimate can be bounded
by confidence intervals that provide information
(not the probability, however, that the population
mean falls within the interval; see Smithson, 2003)
about how confident one can be that the population
mean lies within such intervals. Thus, the sample
mean provides information about a parameter thatapplies to the entire population. That fact appears to
imply substantial generality; it applies to the entire
population (however delimited), so generality
appears maximized. This raises two important issues.
First is the question of representativeness of the
means, both sample and population. That is, identi-
cal or similar means can result from substantially
different distributions of scores. Two examples that
illustrate this fact are given in Figures 7.1 and 7.2.
In Figure 7.1, four distributions of 20 scores are
arrayed horizontally in the upper panel. In the toprow, the values are arithmetically separated, whereas
in the other three, they are clustered in various
ways. Note that none of the four is particularly nor-
mal in appearance, that is, clustered in the middle.
The four plots in the lower panel showwith the
top plot corresponding to the top distribution in the
upper panel, and so onthe means (solid points)
and standard deviations (bars) of the four distribu-
tions. They are, as planned, identical. These data
show that identical means and standard deviations,
the stock in trade of inferential statistics, can beobtained from very different distributions of values.
That is, in these cases the means and standard devia-
tions do not provide a particularly informative or
representative indication of what the individual val-
ues are, which implies that when dealing with aver-
ages of measures, or averages across individuals,
attention must be paid to the representativeness of
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
4/25
Branch and Pennypacker
154
FIGURE 7.1. Upper panel: Four distributions of values,with each symbol representing one value on thex-axis. Lower
panel: The corresponding means and standard deviations ofthe four corresponding distributions from the upper panel.From The Elements of Graphing Data(rev. ed., p. 215), byW. S. Cleveland, 1994, Summit, NJ: Hobart Press. Copyright1994 by AT&T Bell Laboratories. Reprinted with permission.
X
0 5 10 15 20
0
2
4
6
8
10
12
14
X
0 5 10 15 20
0
2
4
6
8
10
12
14
X
0 5 10 15 20
0
2
4
6
8
10
12
14
X
0 5 10 15 20
Y
Y
Y
Y
0
2
4
6
8
10
12
14
FIGURE 7.2. Anscombes quartet. Each of the four graphs shows 11xypairs and the best-fitting (least-squares estimate) straight line through the points. The slopes and intercepts of thelines are identical. From Graphs in Statistical Analysis, by F. J. Anscombe, 1973,AmericanStatistician, 27,pp. 1920. Copyright 1973 by the American Statistical Association. Adaptedwith permission. All rights reserved.
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
5/25
Generality and Generalization of Research Findings
155
the mean, not just its value, or even its standard
deviation. Figure 7.2, which contains what is known
asAnscombes quartet(Anscombe, 1973), provides
an even more dramatic illustration of how focusing
only on the average of a set of numbers can lead one
to miss important features of that set. The four
graphs in Figure 7.2 plot 11 values inxycoordi-nates and show the best-fitting (via the least-squares
method) straight line to the data. Obviously, the dis-
tributions of points are quite different in the four
sets. Nevertheless, the means for thexvalues are all
the same, as are their standard deviations. The same
is true for theyvalues (yielding eight instances of
the sort shown in Figure 7.1). In addition, the slopes
and intercepts of the straight lines are identical for
all four sets, as are the sums of squared errors and
sums of squared residuals. Thus, all four would
yield the same correlation coefficient describing therelation betweenxandy.
The point of these illustrations is to indicate
that a sample mean, even though a predictor of a
population mean, is not necessarily a good descrip-
tion of individual values, so it is not necessarily a
good indicator of the generality across individual
measures. When the measures come from individ-
ual people (or other nonhuman animals), it follows
that the average of the group may not reveal, and
may well conceal, much about individuals. It is
important to remember, therefore, that samplemeans from a group of individuals permit infer-
ences about the population average, but these
means do not permit inferences to individuals
unless it is demonstrated that the mean is, in fact,
representative of individuals. Surprisingly, it is rare
in psychology to see the issue of representativeness
of an average even mentioned, although recently,
in the domain of randomized clinical trials, the
limitations attendant to group averages have been
gaining increased mention (e.g., Penston, 2005;
Williams, 2010).Many experimental designs, nevertheless, involve
comparison across groups with large numbers of
subjects, which raises the question of the practical-
ity of presenting the data of every individual. The
concern is legitimate, but the problem is not solved
by resorting to the study of group averages only.
Excellent techniques for comparing distributions,
like stem-and-leaf plots, box plots, and quantile
quantile plots, are available (Cleveland, 1994;
Tukey, 1977). They provide a more complete
description of measures from individuals, or a useful
subset (as can be the case with quantilequantile
plots), than do simple means and standard errors or
means and confidence intervals. We presume thatas null-hypothesis significance-testing approaches
become less prevalent, more effort will be directed
toward developing new and better techniques for
comparing distributions, methods that will include
and make evident the measures from individuals.
Two Separable Subject Mattersfor Psychology?In some instances, the difference between a popula-
tion parameter, such as the population average, and
the activity of an individual is obvious. For example,consider the average rate of pregnancy in women
between 20 and 30 years old. Suppose that rate is
7%. That, of course, is a useful statistic and can be
used to predict how many women in that age cate-
gory will be pregnant. More important for the pres-
ent purposes, however, is that the value, 7%, applies
to no individual woman. That is, no woman is 7%
pregnant. A woman is either pregnant or she is not.
What of situations, however, in which an average
is representative of the behavior of individuals? For
example, suppose that a particular teaching tech-nique is discovered to result in a 10% increase in
performance on some examination and that the
improvement is at or near 10% for every individual.
Is that not a case in which a group average would
permit estimation of a population mean that is, in
fact, a good descriptor of the effect of the training
for individuals and, because it applies to the popula-
tion, has wide generality? The answer is yes and no.
The point to be made here is somewhat subtle,
and so we elaborate on it with an example. Consider
a situation in which a scientist is trying to determinethe relation between amount of practice at solving
five-letter anagrams and subsequent speed at solving
six-letter anagrams. Suppose, specifically, that no
practice and 10, 50, 100, and 200 anagrams of prac-
tice are to be compared. After the practice, subjects
who have never previously solved anagrams, except
for those seen in the practice phase, are given 50 new
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
6/25
Branch and Pennypacker
156
anagrams to solve, and the time to complete is
recorded. Because total practice might be a determi-
nant of speed, the scientist opts to use a between-
groups design, with each group being exposed to
one of the practice regimens. That is, the hope is to
extract the seemingly pure relation between practice
and later speed, uncontaminated by prior relevantpractice. The scientist then averages the data from
each group and uses those means to describe the
function relating amount of practice to speed of solv-
ing the new, more difficult anagrams. In an actual
case, variability would likely be found among indi-
viduals within each group, so one issue would be
how representative the average is of each member
of each group. For our example, however, assume
that the average is representative, even perfectly so
(i.e., every subject in a group gives exactly the same
value). The scientist has generated a function, proba-bly one that describes an increase in speed of solving
anagrams as a function of amount of prior practice.
In our example, that function allows us to predict
exactly what an individual would do if exposed to a
certain amount of practice. Even though the means
for each group are representative and therefore per-
mit prediction about individual behavior, the impor-
tant point is that the function has no meaning for
an individual. That is, the function does not describe
something that would occur for an individual
because no individual can be exposed to differentamounts of practice for the first time. The function
is an actuarial account, not a description of a
behavioral process. It is, of course, to the extent
that the means are representative, a useful finding.
It is just not descriptive of a behavioral process in
an individual. To examine the same issue at the
level of an individual would require investigation
of sequences of amounts of practice, and that
examination would have to include experiments
that factor in the role of repeated practice. Obvi-
ously, such an endeavor is considerably more com-plicated than the study that generated the actuarial
curve, but it is the only way to develop a science of
individual behavior. The ontogenetic roots of
behavior cumulate over lifetimes. In later portions
of this chapter, we discuss how the complications
may be confronted.
The point is not to diminish the value of actuar-
ial data, nor to suggest that psychologists abandon
the collection and analysis of such data. If means are
highly representative, such data can offer predic-
tions at the individual subject level. Even if the
means are not highly representative, organizations
such as insurance companies and governments canand do make important use of such information in
determining appropriate shared risk or regulatory
policy, respectively. The point is, using insurance
rates as an example, that just because you are in
a particular group, for example, that of drivers
between the ages of 16 and 25, for which the mean
rate of accidents is higher than for another group,
does not indicate that you personally are more likely
to have an automobile accident. It does mean, how-
ever, that for the insurance company to remain prof-
itable, insurance rates need to be higher for allmembers of the group. Similarly, with respect to
health policy, even though most people who smoke
cigarettes do not get lung cancer, the incidence of
lung cancer, on a relative basis, is substantially
greater, on average, in that group. Because the group
is large, even a low incidence rate yields a substan-
tial number of actual lung cancer cases, so it is in
the governments, and the populations, interest to
reduce the number of people who smoke cigarettes.
The crux of the matter is that actuarial and
behavioral data, although related in that the formerdepend on the latter, are distinguishable and, there-
fore, should be distinguished. Psychology, to the
extent that it relies solely on the methods of inferen-
tial statistics that use averages across individuals,
becomes an actuarial science, not a science of behav-
ioral processes. The methods described in this
chapter are aimed at including in psychology its
oft-stated goal of being a science of behavior (or of
the mind). Behavioral and inferred mental processes
really make sense only at the level of the individual.
(The same is true of physiology, which has become arather exact science in part because of the influence
of Claude Bernard, 1865/1957.) A persons behavior,
including thinking, imagining, and so forth, is par-
ticular to that person. That is, people do not share
their minds or their behavior with others, just as
they do not share their physiology. A counterargument
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
7/25
Generality and Generalization of Research Findings
157
is that behavior and mental activity are too variable
from individual to individual to permit a scientific
analysis. We based this chapter on the more opti-
mistic view that such activity is amenable to study at
the level of the individual. Because a good deal of
application of psychological knowledge involves
dealing with individuals, for example, as in psycho-therapy, understanding at the level of the individual
remains a worthy goal. Support for the viewpoint
that a science of individual behavior is possible,
however, requires an elaboration of how an individ-
ual subjectbased analysis can yield information that
is more generally applicable to more than one or a
few individuals.
Why Single-Case Designs Do NotMean That N=1
Traditional approaches, with the attendant limita-tions described thus far, likely arose, at least in part,
because of a legitimate concern about focusing
research on individual subjects who are studied
repeatedly through time (more on this later). Such
research is usually performed with relatively few
subjects, leaving open the possibility that effects
seen might be limited with respect to generality
across other individuals. An example, modeled after
one offered by Sidman (1960), provides a response
to such misgivings. Suppose we were interested in
whether listening to classical music while solvingarithmetic problems improves accuracy. Using a
single-case approach, the study is started with a single
subject. We might first establish a baseline of accu-
racy (more on this later) by measuring it over several
successive exposures. Next, we would test the sub-
ject with the music present and then with it absent.
Suppose we find that accuracy is increased when
music is present and reverts to normal when it is
not. Suppose also that unbeknownst to us, the
effect music will have depends on the baseline level
of accuracy; if accuracy is initially low, it isenhanced by the presence of music, whereas if it is
initially high, it is reduced when the music is on.
We might mistakenly conclude, on the basis of the
results from the one subject, that music increases
accuracy of solving the kinds of arithmetic prob-
lems used.
Let us compare how a more traditional between-
groups approach might fare in dealing with the
issue. We apply music to one group and not to
another. What will result will depend on the distri-
bution of baseline accuracy across individuals.
Figure 7.3 shows three possible population distribu-
tions. In B, most people have low accuracy, in Cmost have high accuracy, and in A people fall into
two groups with respect to baseline accuracy. If one
performed the experiment on groups and took the
group mean to be the indicator of the effect of the
independent variable, the conclusion would depend
on the underlying distribution. In A, the conclusion
FIGURE 7.3. Three hypothetical frequency distribu-tions characterizing the number of people display-ing different baseline rates. From Tactics of ScientificResearch: Evaluating Experimental Data in Psychology(p. 149), by M. Sidman, 1960, New York, NY: BasicBooks. Copyright 1988 by Murray Sidman. Reprintedwith permission.
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
8/25
Branch and Pennypacker
158
might well be that music has no effect, with the low-
ered accuracy in people with high baseline accuracy
canceling out the increases that result among those
with low baseline accuracy. If the population is dis-
tributed as in B, the conclusion would be that music
increases accuracy because the mean would move in
the direction of improved accuracy. The importantpoint is that simply considering the group average
makes it less likely that the baseline dependency that
underlies the effect will be seen.
Let us now compare what might transpire with
the single-case approach, an approach based on rep-
lication. Having seen the effect in the first subject,
we recruit a second and do the experiment again.
Suppose that the population distribution is as
depicted in Figure 7.3B. The most likely scenario is
that the second subject will also have low baseline
accuracy because someone sampled from the popu-lation is most likely to manifest modal characteris-
tics. We get the same result and could, mistakenly,
conclude that music enhances arithmetic accuracy.
That is, we make the same mistake as with the
group-average approach. The difference between the
two approaches, however, is that the group mean
approach makes it more difficult to discover the
underlying, real effect. The single-case approach,
however, if enough replications are done, will even-
tually and inevitably reveal the problem because
sooner or later someone with high baseline accuracywill be examined and show a decrease. A key phrase
in the previous sentence is if enough replications
are done. Whether that happens is likely to depend
on the perceived importance of the effect. If it is
deemed important, it is likely to be subjected to
additional research, which will, in turn, lead to addi-
tional replications. Thus, the single-case approach is
not some sort of panacea with respect to identifying
such relations, but it offers a direct path to correc-
tive action. Of course, it is possible to ferret out the
baseline dependency using a group-mean approach,but that will happen only if attention is paid to the
data of individual subjects in a group. In the single-
case approach, those data are automatically scruti-
nized. A major point is that single casedoes not
necessarily imply that only one or even only a few
subjects be examined. Some research questions
might involve examination of many subjects. (We
discuss later how to decide how many subjects to
test.) What the approach involves is studying each
subject essentially as an independent experiment.
Generality across subjects is therefore examined
directly by seeing how often the experiments effects
are replicated. A second major point is that the
apparent virtues of studying many subjects, a stan-dard aspect of traditional research designs in psy-
chology, are realized only if the data from each
subject are individually analyzed.
Null-Hypothesis Significance Testingand Theory DevelopmentA major goal in any science is the development of
theory, and there is a sense in which theory has clear
relevance to generality. Effective theories are those
that account for a wide array of research results. That
is, they apply generally. The way in which signifi-cance testing is most commonly used in psychology,
however, mitigates against the orderly development
and testing of theories and against the analysis of
competing theories. The problem was first identified
as a paradox by Meehl (1967; see also Meehl, 1978).
The problem is a logical one based largely on the
choice of the null hypothesis as no effect. The logic
of the common approach is as follows. An investiga-
tor has a hypothesis that imposition of a variable,X,
will change another measure, Y.This hypothesis is
sometimes called the alternative hypothesis.Thenull hypothesis is then chosen to be thatXwill not
change Y,that is, that it will be without effect. Next,
theXcondition is imposed, and Yis measured. A
comparison is then made of YwithoutXand Ywith
X.A statistic is then calculated that is generally a
ratio of changes in Yas a result ofXover changes in
Yas a result of anything else. In more technical
terms, the statistic is effect variance over error vari-
ance. The larger the statistic, the smaller thepvalue,
and the more likely it is that statistical significance is
achieved and the null hypothesis rejected. Standardteaching demands that even though one can decide
to reject the null hypothesis, logic prevents one from
accepting the alternative hypothesis. Instead, one
would say that if the null hypothesis is rejected, the
alternative hypothesis gains support.
The paradox noted by Meehl (1967) arises
from the nature of the statistic itself. The size of the
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
9/25
Generality and Generalization of Research Findings
159
statistic is controlled by two values, the effect size
and the error variance, so it can be increased in two
ways. The way of interest for this discussion is via
a decrease in error variance, the denominator. A
major way of decreasing error variance is through
increased experimental rigor (one avenue of which
is to increase the number of subjects). To the degreethat extraneous variables (the anything else men-
tioned earlier) can be eliminated or held constant,
error variance should decrease, making it more
likely that that statistic will be large enough to war-
rant a decision as to statistical significance. The
paradox, therefore, is that as experimental rigor is
increasedthat is, as experimental techniques are
refined and improvedstatistical significance
becomes more likely, with the consequence that the
alternative hypothesis gains support, no matter what
the alternative hypothesis is. That does not seemlike a recipe for cumulative progress in science. Sim-
ple null-hypothesis significance testing with the null
hypothesis set at no effect cannot, by itself, help to
develop theory.
Meehl (1967) described one approach that can
obviate this paradox, which is to use significance
testing with a null hypothesis that is not no effect.
Instead, the null hypothesis is what the theory (or
alternative hypothesis) predicts. Consider how the
logic operates when this tactic is used. As experi-
mental rigor increases, error variance is decreased,making it more likely that the resulting statistic will
reach a critical value. When that value is achieved,
the null hypothesis is rejected, but in this case it is
the investigators theory that is rejected. Rather than
increased experimental rigor resulting in its being
easier for ones theory to gain support, it results in
its being easier to reject ones theory. Increasing
experimental control puts the theory to a more rig-
orous test, not an easier one as is the case when
using the no-effect, or no-difference, null hypothe-
sis. The harder one works to reject a theory and failsto succeed, the more confidence one has in the
theory.
Training in statistical inference, at least for psy-
chologists, does not usually emphasize that the null
hypothesis need not be no effect. It can, neverthe-
less, as just noted, be some particular effect. Note
that it has to be some specific value other than zero.
The use of a particular value as the null hypothesis
therefore requires that ones theory be quantitative
enough to generate a specific value. This approach is
what characterizes tests of goodness of fit (those that
use significance tests) of quantitatively specified
functions.
This approach of setting the null hypothesis ata value predicted by theory is nevertheless not
immune to the previously described weaknesses of
significance testing in general. If, however, signifi-
cance testing is used to make decisions, at least this
latter approach does not suffer from the weakness of
making it easier to support a researchers theory,
regardless of what it is, as methods improve.
In this section of the chapter, we have made the
case, we hope, that commonly used psychology
research methods have limitations in assessing reli-
ability and generality of research findings. In addi-tion, the methods have resulted in many areas of
psychology being largely actuarial, group-average
focused science rather than aimed at the behavior of
individuals. In the next section, we describe the
basics of an alternative approach that is based on
replication rather than significance testing and
group averages. It is useful to remember that impor-
tant science was conducted before the invention of
significance testing, and what follows is a descrip-
tion of the application of methods used to establish
most of modern physics and chemistry (and physiol-ogy) to the study of behavior. The approach focuses
on understanding behavioral processes, rather than
actuarial ones, and has already yielded a good deal
of success, as other chapters in Volume 2 of this
handbook illustrate. We should note, nevertheless,
that even if the goal is actuarial prediction and influ-
ence, the methods of statistical inference are limited
in what they can achieve with respect to reliability
of research findings. As we argue, the only sure way
to examine reliability of results is to repeat them, so
replication is the key strategy for both subject mat-ters of psychology.
ASSESSING RELIABILITY AND
GENERALITY VIA REPLICATION
The two distinguishable categories of replication are
direct replication and systematic replication,
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
10/25
Branch and Pennypacker
160
although, as we show, the distinction is not a sharp
one. Most researchers are familiar with the concept
of direct replication, which refers to repeating an
experiment as exactly as possible. If the results are
the same or similar enough, the initial effect is said
to be replicated. Direct replication, therefore, is
mainly used to assess the reliability of a researchfinding, but as we show, there is a sense in which it
also provides information about generality. System-
atic replication is the designation for a repetition
of the experiment with something altered to see
whether the effect can be observed in changed cir-
cumstances. If the results are replicated, then the
generality of the finding is extended to the new cir-
cumstances. Many varieties of systematic replication
exist, and it is the strategy most relevant to examin-
ing the generality of research findings.
Direct Replication: Within-SubjectReliability and BaselinesIn the first part of this section, we describe the
methods and roles of direct replication with the
same experimental subject (i.e., a truly single-case
experiment). We open with this simplest case, and
with an example, not only to illustrate how the strat-
egy can be used, but also to deal more clearly with
reservations about and limitations of the approach
as well as how decisions about characteristics of the
replicative process may be made.For our example, suppose that we want to mea-
sure the amount of a certain kind of food eaten after
some period without food. We let our subject eat
after 12 hours of fasting; suppose that she eats
250 grams. Direct replication of this observation
would require that we do the same test again. One
possible, but unlikely, result would be that she
would eat 250 grams again, providing an exact repli-
cation. The amount eaten would more likely be
slightly different, say 245 grams. We might then
conduct another replication to see whether the trendtoward eating less was replicable. Suppose on that
occasion our subject eats 257 grams, making it less
likely that there is a trend toward less ingestion with
successive tests. We could repeat the process again
and again. By repeatedly observing the amount eaten
after a 12-hour fast, we gain more confidence with
each successive measurement about how much our
subject will eat of that particular food after 12 hours
of not eating.
One thing that direct replication can provide, via
a sequence of direct, intrasubject replications such
as that just described, is a baseline. The left segment
of Figure 7.4 shows that there appears to be a steady
baseline amount of intake in our example. A ques-tion that might arise is how many observations are
needed to establish a baseline, that is, to come up
with a convincing assessment? The answer is that it
depends. There is no rule or convention about how
many replications are needed to render an outcome
considered reliable in the eyes of the scientific com-
munity. One factor of importance is how much is
already known. In some of the more advanced
physical sciences, a single replication (usually by a
different research team) might be adequate. In our
example, the researcher might have conducted simi-lar research previously, discovered that the baseline
value does not change after 10 observations, and
thus deemed 10 replications enough. The researcher
who chooses replication as a strategy to determine
reliability of findings, therefore, does not have the
comfort of a set of conventions (akin to those avail-
able to investigators who use conventional levels
of statistical significance) to decide whether to
conclude if an effect is reliable enough to warrant
reporting to the scientific community. Instead, the
investigators judgment plays a role, and his or herscientific reputation is dependent to some degree on
Successive Tests
0 2 4 6 8 10 12 14 16 18 20 22 24 26
GramsEaten
0
50
100
150
200
250
300
Baseline - Food 1 Food 2 Food 1
FIGURE 7.4. Hypothetical data from a series of obser-vations of eating. The first 10 points and last six pointsare amounts eaten of Food 1. The middle six points areamounts eaten of Food 2.
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
11/25
Generality and Generalization of Research Findings
161
that judgment. One of the comforts of a set of con-
ventions is that if a researcher abides by them and
results are later discovered, via failed attempts at
replication, not to be reliable, that researchers repu-
tation suffers little. In contrast, one can argue that
there are both advantages and disadvantages to rely-
ing on replication. Important advantages are havingthe benefit of informed judgment, especially of a
seasoned investigator, and the fact that social pres-
sure rides more directly on the researchers reputa-
tion. The disadvantage comes from the lack of an
agreed-on set of conventions. Principled arguments
about which is better for science can be made for
both positions, but we favor the view that science,
as a socialbehavioral activity, will fare better, or at
least no worse, if researchers are held more account-
able for their conclusions about reliability and
generality than for their adherence to a set of arbi-trary, often misunderstood conventions.
Returning to the role of a baseline construed as a
set of intrasubject replications, such baselines can
serve as bases of comparison for effects of experi-
mental changes. For example, after establishing a
baseline of eating of the first food, we could change
what the food is, perhaps less tasty or more or less
calorie laden. The second set of points in Figure 7.4,
which in essence depict measures from a second
set of replications, have been chosen to indicate a
decrease. The reliability of the effect is illustrated bythe successive similarity of values, and judgments
about how many replications are needed would
be based on the same sorts of considerations as
involved in the original baseline. A usual check
would involve return to the original food, and the
third set of points indicates a likely result, once
again with a series of replications. The overall exper-
iment, therefore, is an example of the ubiquitous
A-B-A design (see Chapter 1, this volume).
Replication, of course, need not refer only to a
series of successive measurements under identicalconditions to produce a baseline. If the type of find-
ing summarized in Figure 7.4 were especially coun-
terintuitive or at considerable odds with existing
knowledge, one might well repeat the entire project,
Food 1 to Food 2 to Food 1, and that, too, would
constitute a direct intrasubject replication. In fact,
the entire project could be carried out multiple
times if, in the investigators judgment, such confir-
mation was necessary. Each successful replication
increases confidence that the independent variable,
change of food type, is responsible for the change
in eating.
Direct Replication: Between-SubjectsReliability and GeneralityAfter all this work, an immediate limitation is that
the findings, so far as we know, may well apply
only to the one person studied. Our first result is
based on intrasubject replication. If the goal of the
research was to see whether the change in food can
influence eating, then it may be the case that no fur-
ther replication is needed. It is likely, however, that
our interest extends beyond what is possible to what
is common. In that case, additional replication is in
order, which brings us to the next type of direct rep-lication, replication with different subjects, or inter-
subject replication. Intersubject replication is used to
examine generality, in this case across subjects, and
in this single-case design Nis extended to more
than 1. Intersubject replication makes clear the fuzz-
iness of the distinction between direct and system-
atic replication. The latter is generally defined as a
replication with something changed (see below),
and a new subject is certainly a change. We also sug-
gest that systematic replication is a main strategy for
assessing generality, and by studying a second sub-ject, generality across individuals is on trial. It is
even possible to suggest that most replications, even
intrasubject replications, are, in fact, systematic. For
example, in the intrasubject replication described
above, time is different for successive observations,
and the subject brings a different history to each
observation period. It nevertheless has become stan-
dard to characterize replications in which the proce-
dures are essentially the same as direct replications.
As we outline shortly, systematic replications are
characterized by changes in procedure or conditionsthat can be quite substantial.
As noted in the section Significance Testing and
Generality earlier in this chapter, an emphasis on
replication with individual subjects approaches the
issue of subject generality by increasing the number
of subjects studied. Suppose, for the sake of our
example, we study a second subject, performing the
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
12/25
Branch and Pennypacker
162
entire experiment, baseline, new food, baseline, and
the whole sequence, over again. There are two major
classes of outcomes. One, we get the same effect.
Two, we do not. Let us deal initially with the former
possibility. The first issue is what we would accept
as same. The second persons baseline level would
likely not be exactly the same, and in fact, it mightbe notably different, averaging, say, 200 grams.
Should we count that as a failure to replicate? The
answer is (again), it depends. If our major concern
was the exact amount eaten and the factors contrib-
uting to that, then the result might well be consid-
ered a failure to replicate. We will hold off for a bit,
however, on what to do in the face of such failures,
and move forward on the assumption that we are
not so much concerned with the exact amount eaten
as with whether the change in food results in a
change in amount eaten. In that case, we might rep-licate, with the second subject, the whole sequence
of conditions, Food 1, Food 2, and back to Food 1.
Two possibilities exist: The results are the same as
for the first subject or they are not, and again, conse-
quently, an important issue is what is meant by
same.The results are unlikely, especially in behav-
ioral science, to be identical quantitatively, and,
in fact, if the baseline is different, the change in
intake cannot be identical in both absolute and rela-
tive terms, so we are left to decide whether to focus
on what is different or on what is similar. In thisstage of the discussion, let us assume that intake
decreased, as it had for the first subject. In that case,
we might feel confident that an important feature of
the data has been replicated. A next question, then,
would be whether additional replication with other
subjects is needed. In this particular example, the
answer would most likely be yes, but as is generally
the case, the real answer is that it depends on what
the goals of the experiment are.
Behavioral scientists, by and large, tend to focus
on similarities rather than differences, so if featuresof data reveal similarity across individuals, those
similarities are likely to be pursued. Consider, there-
fore, a situation in which the data for the second
subject are dissimilar, not only in quantitative terms
but in qualitative ones as well. For example, sup-
pose that for the second subject the change from
Food 1 to Food 2 results in an increase in amount
eaten rather than a decrease. Here, there is no ques-
tion that an important aspect of the first result has
not been replicated. What is to be done then? The
answer lies in the assumption of determinism that is
at the core of behavioral science. If there is a differ-
ence observed between Subject 1 and Subject 2, that
difference is the result of some other influence. Thatis, people do not differ for no reason. In fact, the
failure to replicate the exact intake levels at baseline
must also be a result of some factor. Failure to repli-
cate, therefore, is an occasion on which to initiate a
search for the variable or variables responsible for
the differences in outcomes. Suppose, for example,
that Subject 1 was female, and Subject 2 was male.
Tests with other men and women (note the expan-
sion of N) could reveal whether this factor was
important in determining the outcome. Similarly,
we have already assumed different baseline levels, soit might be the case that baseline level is related to
the direction of change in intake, a hypothesis that
can be examined by studying additional subjects. It
is interesting that examination of this second possi-
bility could be aided if the issue of different base-
lines between Subject 1 and Subject 2 had been
assumed to be a failure to replicate. In that case, we
would have focused on reasons for the difference
and may have identified factors that determine base-
line level. If that were so, it might be possible to
control the baseline levels and to change them sys-tematically, thus providing a direct method for
studying the relation between baseline level and the
effect of changing the food.
Another possible reason that disparate effects are
observed between subjects is differing sensitivity to
the particular value of the independent variable
used. In the example just described, the indepen-
dent variable was characterized qualitatively as a
change in food type, making assessment of sensitiv-
ity to it difficult to assess. If, however, the indepen-
dent variable can be characterized quantitatively, forinstance by carbohydrate content in our example,
the technique of systematic replication, elaborated
below, can be used to examine the possibility.
An important issue in considering direct replica-
tion arises when intersubject replication succeeds
but intrasubject replication does not. Taking our
example, suppose that when the conditions were
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
13/25
Generality and Generalization of Research Findings
163
changed back to Food 1 with our first subject (cf.
Figure 7.4), eating remained at the lower level,
which would prevent replication of the effect in
Subject 1. Such a result indicates either that some
variable other than the change of food was responsi-
ble for the decrease in eating or that the exposure to
Food 2 has produced a long-lasting change in eat-ing. Support for the second view can come from
attempts at intersubject replication. If experiments
with subsequent subjects reveal that a shift from
Food 1 to Food 2 results in a relatively permanent
decrease in eating, the effect is verified.
When initial effects are not recaptured after
intervening experience that produces a change, the
change is said to be irreversible. Using replication to
examine irreversible effects requires intersubject
replication, so we have here another instance in
which N=1 does not mean that only one subjectneed be studied. Many effects in psychology are irre-
versible, for example, those that we call learning, so
the individual subject approach requires that inter-
subject replication be used to assess the reliability of
such effects, and in so doing the generality of the
effect across subjects is automatically examined.
A focus on each subject individually, of course,
does not prevent the use of traditional data analysis
approaches, should an investigator be so inclined
(for inferential statistical analyses appropriate to
single-case research designs, see Chapters 11 and12, this volume). Some, for example, might want to
present group averages so that actuarial predictions
can be made. Standard techniques can be used sim-
ply by engaging in the usual sorts of data manipula-
tion. An emphasis on the data from individuals,
nevertheless, can be used to enhance the presenta-
tion. For example, consider a study by Dunn,
Sigmon, Thomas, Heil, and Higgins (2008), who
compared two conditions aimed at reducing ciga-
rette smoking. In one, vouchers were given contin-
gent on breath samples that indicated that nosmoking had occurred, whereas in the other the
vouchers were given independently of whether the
subject had smoked. Figure 7.5 shows some of the
results. The bars show group means, and the dots
show data from each individual, illustrating the
degree to which effects were replicable across
patients and the representativeness of the group
averages. Such a display of data provides consider-ably more useful information than do presentations
that include only means or results of tests of statisti-
cal significance.
Systematic Replication: ParametricExperimentsTo this point, our emphasis has been on the
intra- and intersubject generality and reliability of
effects, and we have argued that individual subject
approaches can be effectively used to assess it. Gen-
erality of effects, however, is not limited to general-
ity across individuals, and it is to other forms of
generality, culminating with scientific generality, to
which we now turn.
As noted earlier, systematic replicationrefers to
replication with something changed, and, as also
noted, a case can be made that replication with a
new subject is a form of systematic replication in
FIGURE 7.5. Number of days of continuous absti-nence from smoking cigarettes in two groups of sub-
jects. Circles are data from individuals. Open bars andbrackets show the group means and standard errors
of those means. Subjects represented by the left barreceived vouchers contingent on abstinence, whereasthose represented by the right bar received vouchersindependent of their behavior. The top bracket andasterisk indicate that the mean difference was statisti-cally significant at the .01 level. From Voucher-BasedContingent Reinforcement of Smoking AbstinenceAmong Methadone-Maintained Patients: A Pilot Study,by K. E. Dunn, S. C. Sigmon, C. S. Thomas, S. H. Heil,and S. T. Higgins, 2008,Journal of Applied BehaviorAnalysis, 41,p. 533. Copyright 2008 by the Society forthe Experimental Analysis of Behavior, Inc. Reprintedwith permission.
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
14/25
Branch and Pennypacker
164
that it is an experiment with something changed,
namely the experimental subject. From such replica-
tions come assessments of the across-subject gener-
ality of effects. In this section, we discuss other sorts
of changes between experiments that constitute sys-
tematic replication. To do so, let us begin again with
our example of effects of food type on eating. Sup-pose that after obtaining the data in Figure 7.4, we
perform a systematic replication of the study rather
than a direct repetition. For example, we might
notice that Food 2s carbohydrate content is higher
than that of Food 1. We decide, therefore, to alter
the carbohydrate content of Food 2 (and let us
assume, likely impossible, without changing the
taste) so that it matches that of Food 1, and repeat
the experiment. Such an experiment would examine
the generality of Food 2s effect on eating to a new
carbohydrate level. If adjusting Food 2s carbohy-drate amount to equal that of Food 1 resulted in the
switch in foods having no effect on eating, two
things can be concluded. One, the original result
was not replicated. In such cases, it is often wise
to replicate the original experiment to determine
whether unknown variables might have been
responsible. Two, carbohydrate amount is identified
as a likely important variable. Thus, systematic rep-
lication is not only a method for discovering gener-
ality of effects, it is also an approach that can lead to
finding controlling variables.Continuing our description of types of systematic
replication, let us assume we decide to examine
more fully the role of carbohydrates in eating. Our
original experiment may be conducted several times
but with a different carbohydrate mix for Food 2 on
each occasion. Each repetition of the experiment,
then, constitutes a systematic replication because a
new value of carbohydrate is used for each instance.
Experiments that systematically vary the value of a
variable are calledparametricexperiments, and they
play an especially important role in assessing gener-ality. Consider the data in Figure 7.6, which are
constructed to emulate what might result if several
intersubject replications of a parametric experiment
were conducted.
Parametric examination provides a number of
advantages when assessing the reliability and gener-
ality of results. First, had only a single value of the
independent variable been assessed, we might havebeen less than impressed with the degree of inter-
subject replicability of the data. The results of para-
metric examination, however, reveal a good deal of
similarity across the three subjects: All show the
same basic relation. At low percentages, the amount
eaten is roughly constant within each individual.
As the percentage increases, the amount eaten
decreases until the percentage reaches a value above
which further increases are associated with no
changes in amount eaten. Second, and this is a key
characteristic of parametric evaluation, the data sug-gest that only a range of levels of the independent
variable result in a change in behavior. That is, para-
metric experiments permit the identification of
boundary conditions, or limiting conditions, outside
of which a variable is relatively ineffective. As we
show later when dealing with the issue of scientific
generality, information about boundary conditions
can be extremely important.
Figure 7.6 also illustrates how parametric experi-
ments can help deal with the problem of lack of
intersubject replicability when a single value of anindependent variable is examined. Recalling our
original example of comparison of food types, con-
sider what could have happened if our first two sub-
jects were Subjects 1 and 3 of Figure 7.6 and Food 1
had contained 20% carbohydrate and Food 2 had
contained 25%. Changing the food type would have
produced a change for Subject 1 but not for Subject 3,
FIGURE 7.6. Hypothetical data for three subjectsshowing the relationship between carbohydrate contentand amount eaten.
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
15/25
Generality and Generalization of Research Findings
165
leading to a conclusion that we had failed to repli-
cate the food change effect across subjects. The
parametric examination, however, shows that both
subjects are similar in how food intake was influ-
enced by carbohydrate content, except that behavior
of the two subjects was sensitive in a slightly differ-
ent range. One of the most satisfying outcomes ofparametric experiments is when they reveal similari-
ties that cannot be judged when only a single value
of an independent variable is tested.
It is worth noting, too, that parametric experi-
ments can reveal that apparent intersubject replica-
bility can be misleading regarding how a variable
influences behavior. It is possible that tests with a
single value of an independent variable might lead
to very similar quantitative results for several sub-
jects, whereas a parametric analysis reveals that very
different functions describing the relation betweenthe independent variable happen to cross or come
close together at the particular value of the indepen-
dent variable evaluated.
Parametric experiments illustrate one of the
strengths of being able to characterize independent
variables quantitatively. Experiments that determine
how much of this yields how much of that provide
more information about generality than do experi-
ments that simply test whether a particular value
of an independent variable has an effect. They can
identify similarity where none is evident with a sin-gle value of an independent variable, and they can
also determine whether apparent similarity is
unrepresentative.
We should note that parametric experiments are
not limited in application to only primary indepen-
dent variables, such as that shown in our fictitious
example. Any variable associated with an experiment
can be systematically varied. As an example, the
experiment just described could be conducted under
a range of temperatures, a range of degrees of hydra-
tion of the subjects, a range of times without foodbefore the test, and any of several other variables.
Those experiments, too, would provide information
about the range of conditions under which the inde-
pendent variable of carbohydrate content exerts its
effects in the circumstances of the experiment.
Parametric experiments, although very important,
are not the only kind of systematic replications. One
other type involves using earlier findings as a starting
point, or baseline, for examination of other variables.
As an example, consider the phenomenon of false
memory in the laboratory, produced by a procedure
originally developed by Deese (1959) and later elabo-
rated by Roediger and McDermott (1995). In these
studies, subjects said they recalled or recognizedwords that were not presented. A great deal of
research followed the original demonstrations, and
these experiments varied procedural details, measure-
ment techniques, subject characteristics, and so forth.
In each instance, therefore, in which the false memory
effect was reproduced, the reliability of the phenome-
non was demonstrated and its generality extended.
Using the reproduction of previous findings as a start-
ing point for subsequent research, therefore, is a use-
ful and productive technique for examining reliability
and generality of research outcomes.Sidman (1960), in his characterization of tech-
niques of systematic replication, described a type he
called systematic replication by affirming the conse-
quent (p. 127). Essentially, this approach is very
similar to the idea of hypothesis testing because the
systematic replication is not based on simply chang-
ing some aspect of the experiment to see whether
effects can still be reproduced but rather on what the
investigator sees to be the implications of previous
results. That is, the replication may be based on the
investigators interpretation of what the data mean.For example, consider our fictitious study of the
effects of carbohydrate content on eating. That
result, and perhaps those of other experiments,
might suggest that the phenomenon is not specific
to eating. Carbohydrate ingestion possibly leads to
general lethargy or low motivation for voluntary
behavior. If we suspect that, we might devise other
experiments that could be viewed as systematic rep-
lications based on the possible implications of the
previous findings. If the results were consistent with
the lethargy interpretation, the view would gain incredence; if they were not, the view might well be
abandoned. As Sidman (1960) noted, definite con-
clusions may not be drawn from successful replica-
tions by affirming the consequent, but, as he also
noted, the approach is essential to science. The
degree to which ones confidence in an interpreta-
tion of data grows with successful replications
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
16/25
Branch and Pennypacker
166
depends on many things, not the least of which is
how counterintuitive the predicted outcome is.
Types of Generality Assessed andEstablished by Systematic Replication
Johnston and Pennypacker (2009) offered a useful
characterization of the dimensions along which gen-erality can be examined. They initially suggested a
dichotomy between generality of and generality
across. Generality acrossis simple to understand.
As we have already noted, replication can be used to
determine generality across subjects or situations, a
type of generality usually of considerable interest.
Systematic replication comes to the fore in the
assessment of generality across species and across
settings. By definition, systematic replication is an
attempt at replication with something different, so if
the species is changed, or if something (or a lot)about the setting is altered, the replication attempt is
a systematic one. In both cases, the issue of what
constitutes a successful replication may arise. Con-
sider, for example, if we decided to attempt a cross-
species replication of our experiments with food
types, and our new species was a mouse. Obviously,
mice would eat considerably less, and therefore a
precise, quantitative replication would not be possi-
ble. We might (actually, probably would), however,
argue that the replication was successful if the rela-
tion between carbohydrate content and eating wasreplicated, that is, if at low concentrations there was
little effect on eating, but as carbohydrate content
increased, the amount eaten decreased until some
level is reached above which further decreases were
not seen (cf. Figure 7.6).
What if the content values at which the decreases
begin and end differ between the species? For exam-
ple, mice may begin to show a decline when the
food reaches 15% carbohydrate, whereas with the
humans, decreases are not evident until the food
contains 25% carbohydrate. Is that a failure to repli-cate? Again, the answer is yes and no. The business
of science is to find regularities in nature, so empha-
sis is properly placed on similarities. Differences
virtually always exist, so they are easy to find. Nev-
ertheless, they cannot be ignored entirely, but their
main role is not to indicate that the similarities
evident are somehow unimportant, but rather to
promote further research into the origins of the dif-
ferences if the differences are judged to be impor-
tant. The scientist and the scientific community
make judgments about the need for further investi-
gation of the differences that are always present in
replications.
Generality ofalso plays an essential role in sci-ence. Johnston and Pennypacker (2009) described
several categories of generality of,but here we focus
on one in hopes of making the concept clear: gener-
ality of process. Our example is a behavioral process
familiar to most psychologists, specifically the pro-
cess of reinforcement of operant (purposive) behav-
ior. Reinforcementrefers to the increase in likelihood
of behavior as a result of earlier instances being fol-
lowed by certain consequences, which is the pro-
cess. Systematic replications across an immense
range of both behavioral activities and a very largerange of consequences have been shown to provide
instances of the process. For example, in addition to
the traditional lever press and key peck, activities
ranging from the electrical activity of an impercepti-
ble movement of the thumb (de Hefferline, Keenan,
& Harford, 1959), to vocal responses of chicks
(Lane, 1960), to generalized imitation in children
with developmental delays (Baer, Peterson, & Sher-
man, 1967), to the extensive range of activities
described in the use of reinforcement in the treat-
ment of behavior disorders (e.g., Martin & Pear,2007; Ullman & Krasner, 1966) have all been shown
as instances of the process. Similarly, the range of
events used as effective consequences to produce
reinforcement is also broad. Consequences such as
praise, food, intravenous drug administration, open-
ing a window, reducing a loud noise, access to exer-
cise, and many, many others have been effectively
used to produce reinforcement. All the reports
may be viewed as describing systematic replications
of the earliest experiments on the process (e.g.,
Skinner, 1932; Thorndike, 1898).This generality of process is what stands as the
rationale for speaking of reinforcement theory. The
argument is similar to that offered for the motion of
objects. Whatever those objects are, and whether
they are falling, floating, being ballistically pro-
jected, or orbiting in outer space, they can be sub-
sumed under the notion of gravitational attraction,
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
17/25
Generality and Generalization of Research Findings
167
Newtons theory of gravity. An even more dramatic
example is provided by living things. All manner of
plants and animals populate the earth, and their dif-
ferences are obvious and virtually countless. What is
less obvious but explains the variety is that all life
can be considered to have developed from the opera-
tion of three processes: variation, selection, andretention (Darwin, 1859). The sameness of cellular
architecture, including nuclear material (e.g., DNA
and RNA), also attests to the similarity. Likewise, all
the myriad instances of reinforcement suggest that
considering them instances of a single process is rea-
sonable. As noted earlier, an important goal of sci-
ence is to discover uniformities. In fact, as Duhem
(1954) noted, one of the key features of explanation
is identification of the like in the unlike. Objects
look different, are made of different substances, and
may or may not be moving in variety of ways, butthey are similar in how they are affected by gravity.
Behavioral activities take on many forms, and as just
noted, so can the consequences of those activities.
Nevertheless, they can (on many occasions) exhibit
the phenomenon known theoretically as reinforce-
ment, an instance of generality of process.
Scientific GeneralityAnother extremely important concept is scientific
generality, a type of generality that has some coun-
terintuitive characteristics. Scientific generality isimportant for at least two reasons. One, scientific
generality speaks to scientists ability to reproduce
their own findings and those of other scientists, as
well. Two, scientific generality speaks directly to
the possibility of effective application and translation
of laboratory findings to the world at large, as dis-
cussed more fully later in the last section of this
chapter. Scientific generality is defined by knowl-
edgeable reproducibility. That is, it is not character-
ized in terms of breadth of applicability, but instead
in terms of identification of factors that are requiredfor a phenomenon to occur. To illustrate the differ-
ence between scientific generality and, for example,
generality across people, consider again the fictitious
experiment on food types. Suppose that the original
experiments were all performed with male subjects.
On an attempt at replication with female subjects,
it is discovered that food type, or carbohydrate
composition, has no effect at all on eating. That, of
course, would be clear indication of a limit to the
across-subjects generality of the effect on eating. It
would, however, represent an increase in scientific
generality because it specifies more clearly the con-
ditions required to produce the phenomenon of
reduced food intake. As stated by Johnston andPennypacker (2009), A procedure can be quite
valuable even though it is effective under a narrow
range of conditions, as long as we know what those
conditions are (pp. 343344). The vital role that
systematic replication, and even failures of system-
atic replication, can play in establishing scientific
generality therefore becomes evident. Scientific gen-
erality represents an understanding of the variables
responsible for a phenomenon.
GENERALIZATION, TECHNOLOGY
TRANSFER, AND TRANSLATIONAL
RESEARCH
The function of any science is the acquisition of
basic knowledge. A secondary benefit is often the
possibility of applying that knowledge in ways that
impart benefit to some element of the culture at
large. For example, Galileos basic astronomic obser-
vations eventually led to improved navigation proce-
dures with attendant benefits to the colonial powers
of 17th-century Europe. Pasteurs discovery in 1863of the microorganisms that sour wine and turn it
into vinegar, and the observation that heat would
kill them, led eventually to the germ theory of dis-
ease and the development of vaccines.
In the case of behavior analysis, a relatively
young science, sufficient basic knowledge has been
acquired to permit vigorous attempts at application.
A discipline known as applied behavior analysis,
discussed extensively elsewhere in Volume 2 of this
handbook, is the primary result of these efforts,
although application of the findings of behavioranalysis are to be found in a variety of other disci-
plines including medicine, education, and manage-
ment, to name but a few.
In this section, we describe issues surrounding
attempts to apply laboratory research findings in the
wider world at large. Specifically, we discuss topics
related to applying research findings from controlled
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
18/25
Branch and Pennypacker
168
laboratory or therapeutic settings to new situations
or less controlled environments. First, we describe
the issue of generalization of behavioral treatment
effects from treatment settings to real-world circum-
stances. Then we outline basic general strategies
for effective transfer of technologies, taking into
account the known scientific generality of behav-ioral processes. Finally, we offer comments on the
notion of translational research, a matter of much
contemporary interest.
Generalization of ApplicationsOne of the earliest subjects of discussion that arose
with the development of behavior therapy and
behavior modification techniques was the issue
referred to as generalization (e.g., Yates, 1970). Spe-
cifically, there was concern about whether improve-
ments produced in a therapy setting would alsoappear in other, nontherapy (e.g., everyday life) sit-
uations. The term generalizationwas borrowed from
a core behavioral process discovered by experimen-
tal psychologists, that after learning to respond in a
particular way in the presence of a particular stimu-
lus, say frequency of a tone, the same behavior may
also occur in the presence of other more or less sim-
ilar stimuli, say, other frequencies of the tone. It is
an apparently simple logical step to suggest that
behavior learned in a therapy environment might
also appear in nontherapy, real-world environments,and when it does so, the result can be called general-
ization (but see Johnston, 1979, for problems with
such a simple extrapolation). Because applied
behavior analysis generally involves establishing
conditions that alter behavior, the issue of whether
those changes are restricted to the learning situa-
tions arranged or whether they become evident in
other situations is usually important. For example,
if a child who engages in aggressive behavior is
exposed to a treatment program to reduce aggres-
sion, a goal would be to decrease aggression notonly in the treatment setting but in all settings.
In a seminal article, Stokes and Baer (1977) dis-
cussed the issue of generalization of treatment
effects. A key contribution of their article was to
indicate that in general, if effects of a treatment are
to be manifested in a variety of circumstances,
achieving that outcome must be considered in
designing the intervention intended to effect the
change in behavior. That is, it is not always suffi-
cient to simply arrange circumstances that produce
a desired change in behavior in the circumscribed
environment in which the treatment is undertaken.
Instead, procedures should be used that increase the
probability that the change will be enduring andmanifested in those parts of a clients environment
in which the changes are useful. That insight has
been followed by the development of general strate-
gies to enhance the likelihood that behavior changes
occur not only in the treatment environment but
also in other appropriate ones.
For example, Miltenberger (2008) described
several general strategies that can be used to pro-
mote generalization of treatment effects. The most
direct strategy is to arrange for rewards to occur
immediately after instances of generalization occur.Such an approach essentially entails taking treat-
ment to the environments in which it is hoped the
new behavioral patterns will occur. That is, the
training environment is not explicitly delimited.
Such an approach is now widespread in applied
behavior analysis partly as a consequence of an
emphasis on analyzing reinforcement functions
before implementing treatment (see Iwata, Dorsey,
Slifer, Bauman, & Richman, 1982). This approach
to problem behavior entails discovering whether
the behavior is maintained by reinforcement, and ifit is, identifying what the reinforcers are in the
environments in which the problem behavior
occurs. Once the reinforcers responsible for the
maintenance of the problem behavior are identi-
fied, then procedures based on that knowledge are
implemented in the situations in which the behav-
ior occurs.
A related second strategy identified by Milten-
berger (2008) is consideration of the conditions
operating in the environments in which the changed
behavior would be occurring. The idea here is thatbehavior that is changed in the therapeutic setting,
for example learning effective social skills, will lead,
if performed, to more satisfying social interactions
in the nontherapy environment, and those successes
will help to solidify the gains made in the therapy
sessions. In designing the therapeutic goals, there-
fore, consideration is given to what sorts of behavior
-
5/20/2018 7 Branch&Pennypacker Generalization and Generality
19/25
Generality and Generalization of Research Findings
169
are most likely to be successful in the nontherapy
environment.
A less obvious strategy applies when the nonther-
apy environment appears to offer little or no support
for the changed behavior. An example is when ther-
apy is aimed at training an adolescent to walk away
from aggressive provocation in a schoolyard. Behav-ing in such an aggression-thwarting manner is not
likely to result in positive outcomes with peers, who
are instead likely to provide taunts and jeers after
such actions. In such a case, it may be prudent to try
to change the normal consequences in the school-
yard by having teachers or other monitors provide
positive consequences (perhaps in the form of privi-
leges, praise, etc.) for such actions when they occur.
That is, the strategy here involves altering the con-
tingencies operating in the nontherapy environment.
A fourth general strategy is to try to make thetherapy setting more like the nontherapy environ-
ment in which the changed behavior is to occur. A
study by Poche, Brouwer, and Swearingen (1981)
illustrated this approach. They taught abduction
prevention skills to preschool children, but in so
doing incorporated a relatively large number of
abduction lures in the training. The intent was that
by including a wide variety of possible lures that
might be used by would-be kidnappers, the training
would be more effective in real-world situations
than if it had not involved those variations. The gen-eral strategy in this case was to train with as many
likely relevant situations as possible. Another way to
view this strategy is that it involves incorporating
stimuli that are present in the nontherapy environ-
ment into the training.
A fifth approach is somewhat less intuitive, but
research has suggested that it may be effective. The
core idea is that if a variety of different forms of
effective behavior are established by the therapy or
training, the chance of effective behavior occurring
in the nontherapy environment is better, and as aresult the successful behavior will be supported and
continue to occur. As a simple illustration, Milten-
berger (2008) offered the example of teaching a shy
person a variety of specific ways to ask for a date,
which provides the person with several actions to
try, some of which are likely to be successful outside
of therapy.
In this section, we focused on particular strate-
gies for ensuring that desired changes in behavior
established through therapeutic methods occur and
persist in nontraining or nontherapy environments,
that is, in the everyday world. Employment of tactics
emerging from the strategies described has yielded
many successes, and the methods are part of thearmamentarium of applied behavior analysts. These
techniques to promote generalization of behavior
changes have emerged from a consideration of
fundamental behavioral processes that have been
identified and analyzed in basic research and then
subsequently validated as effective through applied
research. They represent, consequently, what can be
called successful transfer from basic science to effec-
tive technology, namely, an instance of what has
come to be called technology transfer.In the next
section, we discuss some general principles of effec-tive technology transfer.
Technology TransferPeople often use the term technologyto refer to the
body o