7 Branch&Pennypacker Generalization and Generality

25
151 DOI: 10.1037/13937-007  APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles,  G. J. Madden (Editor-in-Chief) Copyright © 2013 by the American Psychological Association. All rights reserved. C HAPTER  7 GENERALITY AND GENERALIZATION OF RESEARCH FINDINGS Marc N. Branch and Henry S. Pennypacker For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication. (Cohen, 1994, p. 997) Confirmation comes from repetition. . . . Repetition is the basis for judging . . . sig- nificance and confidence. (Tukey, 1969, pp. 84–85) As the general psychology research community becomes increasingly aware (e.g., Cohen, 1994; Lof- tus, 1991, 1996; Wilkinson & Task Force on Statisti- cal Inference, 1999) of the limitations of traditional group designs and statistical inference methods with regard to assessing reliability and generality of research findings, we present an alternative approach that has been substantially developed in the branch of psychology now known as behavior analysis. In this chapter, we outline how individual subject methods, that is, so-called single-case designs, provide straight- forward and, in principle, simple methods to assess the reliability and generality of research findings. OVERVIEW The chapter consists of three major sections. In the first, we summarize the limitations of tradi- tional methods, especially as they relate to assess- ing reliability and generality of research findings concerning behavior. We make the case that tradi- tional methods have obscured an important dis- tinction that has led to psychology’s consisting of two related, but separable, subject matters, behav- ioral science and actuarial science. We also focus on the issue of generality across individuals and how traditional methods can give the illusion of such generality. In the second major section, we discuss dimensions of generality in addition to generality across individuals. Here we define scien- tific generality and several other forms of generality as well. In so doing, we introduce the roles of rep- lication, both direct and systematic, in assessing generality of research results. We argue that repli- cation, instead of statistical inference, is an alter- native primary method for determining not only the reliability of results but also for assessing and characterizing the generality of scientific findings. In the third major section, we discuss generaliza- tion of treatment effects, the fundamentals of tech- nology transfer, and the practices that characterize translational research. There, we write of program- ming for and assessment of generalizability of sci- entific findings to applied settings. We expand our view then to the engineering issues of technology development (or technology transfer and transla- tional research) as a capstone demonstration of generalization based on an understanding of gen- erality of research findings. LIMITATIONS OF TRADITIONAL METHODS The traditional group-mean, statistical-inference approach to analyzing research results has faced Preparation of this chapter was supported by National Institute on Drug Abuse Grant DA004074.

Transcript of 7 Branch&Pennypacker Generalization and Generality

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    1/25

    151

    DOI: 10.1037/13937-007APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles,G. J. Madden (Editor-in-Chief)Copyright 2013 by the American Psychological Association. All rights reserved.

    C H A P T E R 7

    GENERALITY AND GENERALIZATION

    OF RESEARCH FINDINGS

    Marc N. Branch and Henry S. Pennypacker

    For generalization, psychologists must

    finally rely, as has been done in all the

    older sciences, on replication. (Cohen,

    1994, p. 997)

    Confirmation comes from repetition. . . .

    Repetition is the basis for judging . . . sig-

    nificance and confidence. (Tukey, 1969,

    pp. 8485)

    As the general psychology research community

    becomes increasingly aware (e.g., Cohen, 1994; Lof-

    tus, 1991, 1996; Wilkinson & Task Force on Statisti-

    cal Inference, 1999) of the limitations of traditional

    group designs and statistical inference methods with

    regard to assessing reliability and generality of

    research findings, we present an alternative approach

    that has been substantially developed in the branch of

    psychology now known as behavior analysis. In this

    chapter, we outline how individual subject methods,

    that is, so-called single-case designs, provide straight-

    forward and, in principle, simple methods to assess

    the reliability and generality of research findings.

    OVERVIEW

    The chapter consists of three major sections. Inthe first, we summarize the limitations of tradi-

    tional methods, especially as they relate to assess-

    ing reliability and generality of research findings

    concerning behavior. We make the case that tradi-

    tional methods have obscured an important dis-

    tinction that has led to psychologys consisting of

    two related, but separable, subject matters, behav-

    ioral science and actuarial science. We also focus

    on the issue of generality across individuals and

    how traditional methods can give the illusion of

    such generality. In the second major section, we

    discuss dimensions of generality in addition to

    generality across individuals. Here we define scien-

    tific generalityand several other forms of generality

    as well. In so doing, we introduce the roles of rep-

    lication, both direct and systematic, in assessing

    generality of research results. We argue that repli-

    cation, instead of statistical inference, is an alter-

    native primary method for determining not only

    the reliability of results but also for assessing and

    characterizing the generality of scientific findings.

    In the third major section, we discuss generaliza-

    tion of treatment effects, the fundamentals of tech-

    nology transfer, and the practices that characterize

    translational research. There, we write of program-

    ming for and assessment of generalizability of sci-

    entific findings to applied settings. We expand our

    view then to the engineering issues of technology

    development (or technology transfer and transla-

    tional research) as a capstone demonstration of

    generalization based on an understanding of gen-

    erality of research findings.

    LIMITATIONS OF TRADITIONAL

    METHODS

    The traditional group-mean, statistical-inference

    approach to analyzing research results has faced

    Preparation of this chapter was supported by National Institute on Drug Abuse Grant DA004074.

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    2/25

    Branch and Pennypacker

    152

    consistent criticism for more than 4 decades (e.g.,

    Bakan, 1966; Carver, 1978; Cohen, 1994; Gigeren-

    zer, Krauss, & Vitouch, 2004; Loftus, 1991, 1996;

    Meehl, 1967, 1978; Nickerson, 2000; Rozeboom,

    1960). Most of that criticism has focused on what

    those methods have to say about the reliability of

    research findings, which is appropriate because iffindings are not reliable, there is no need to assess

    their generality. These methods, however, have also

    been criticized with respect to theory testing and

    development, issues that directly relate to generality.

    We treat these two categories of criticism separately.

    Significance Testing and ReliabilityAfter all of the carefully reasoned criticism of signifi-

    cance testing that has been published, one would

    hope that a clear understanding of its limits would

    exist among professional psychologists. That, how-ever, appears not to be true, as noted by Cohen

    (1994), who lamented that

    after 4 decades of severe criticism, the

    ritual of null hypothesis significance

    testing . . . still persists. [As does] near

    universal misinterpretation ofpas the

    probability that H-sub-0 is false, [and]

    the misinterpretation that its comple-

    ment is the probability of successful rep-

    lication. (p. 997)Cohens assertion is supported by survey evidence

    revealing that a substantial majority of academic

    research psychologists incorrectly interpretpvalues

    and statistical significance (Haller & Krauss, 2002;

    Kalinowski, Fidler, & Cumming, 2008; Oakes,

    1986). That a significant proportion of professional

    psychologists do not appreciate what statistical

    significance and, especially,pvalues represent is

    apparent testimony to a weakness in the training of

    research psychologists, a failing that lies at the feet

    of those of us who are engaged in teaching them. Infact, Haller and Krauss (2002) included a sample

    of statistical methodology instructors in their study

    and found that 80% of them were mistaken in their

    understanding ofpvalues, so it comes as less of a

    surprise that the misconceptions are widespread. The

    following discussion, therefore, is another attempt to

    make clear what apvalue is and what it means.

    Apvalue, which results from a significance test,

    is a conditional probability. Specifically, it is the

    probability, if the null hypothesis is true, of obtain-

    ing data of a particular sort. That is, in algebraic

    symbols, it isp=P(Data|H0). The important point

    is thatpP(H0|Data), which is what a researcher

    would presumably really like to know. In otherwords, apvalue does not provide quantitative infor-

    mation about whether the null hypothesis is true,

    which is apparently widely misunderstood. Because

    it does not provide the oft-assumed information

    about the likelihood of the null hypothesis being

    true, apvalue of .01 does not mean that the proba-

    bility of the null hypothesis being true is 1 in 100.

    In fact, it conveys nothing quantitative about the

    truth of the null hypothesis. To see why, note that

    changing the order of conditionality in a condi-

    tional probability is crucially important. Considersuch examples as P(Dead|Electrocuted) versus

    P(Electrocuted|Dead) or P(Cloudy|Raining) versus

    P(Raining|Cloudy). The first probability in each pair

    tells nothing about the second, just as P(Data|H0)

    reveals nothing about P(H0|Data). Apvalue, there-

    fore, has quantitative meaning only if the null

    hypothesis is true, but when performing statistical

    tests not only does one not know whether the null

    hypothesis is true, one probably assumes it is not.

    The important fact is that a finding of statistical sig-

    nificance, via a smallpvalue, does not imply thatthe null hypothesis is unlikely to be true. The incor-

    rect logic underlying the mistaken conclusion (cf.

    Falk & Greenbaum, 1995) apparently goes as fol-

    lows: If the null hypothesis is true, data of a certain

    sort are unlikely. I obtained data of that sort, so

    therefore the null hypothesis is unlikely to be true.

    That so-called logic is precisely the same as the fol-

    lowing: If the next person I meet is an American, he

    or she is unlikely to be the President. I just met the

    President. Therefore, he or she is unlikely to be an

    American.The fundamental misunderstanding of what ap

    value is leads directly to the more serious problem

    of assuming that it indicates something quantitative

    about the reliability, that is, the likelihood of repli-

    cation, of the finding. A common misunderstanding

    (see Haller & Krauss, 2002, and Oakes, 1986, for

    evidence) is that apvalue, for example of .01, is the

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    3/25

    Generality and Generalization of Research Findings

    153

    complement of the probability of replication should

    the experiment be repeated. That is, the mistaken

    assumption is that if one conducted the experiment

    100 times, one should replicate the result on 99 of

    those occasions (at least on average). If one knew

    that the null hypotheses were true, then that would

    be a correct interpretation of thepvalue. Of course,though, one does not know whether H0is true

    (again, one usually hopes it is not). In fact, one con-

    ducts the statistical test so that one can make what

    one (mistakenly) hopes is an educated guess about

    whether it is true. Thus, to say on the basis of a

    smallpvalue that a result is statistically reliable is

    to strain the meaning of reliablebeyond reasonable

    limits.

    This limitation of statistical significance is not

    based on technical details of the null hypothesis.

    That is, the problem does not lie with whether theunderlying distribution is formally normal or near

    normal or whether the statistical test involved is

    demonstrably robust with respect to violations of

    assumptions about the underlying distribution. The

    limitation is based in the logic of the approach. All

    the assumptions about the distributional character-

    istic null hypothesis might in fact be true, but that is

    not relevant when one is speaking of what apvalue

    indicates.

    A major limitation of statistical significance,

    therefore, is that it does not provide direct informa-tion about the reliability of research findings. With-

    out knowledge about reliability, no examination of

    generality can occur because repeatability is the

    most basic test of generality. Notwithstanding that

    limitation, however, significance testing based on

    group means may be seen, incorrectly, to have

    implications for generality of findings across sub-

    jects. Adherence to this view unfortunately gains

    strength as sample size increases. In fact, however,

    regardless of sample size, no information about

    intersubject generality can be extracted from asignificance statement because no knowledge is

    afforded concerning the number of subjects for

    whom the effect actually occurred. We examine

    the implications of this fact in more detail below.

    Aside from the limits surrounding reliability just

    described, other characteristics of group-mean data

    warrant examination as we move into a discussion

    of generality. It is here that we show that psychol-

    ogy, presumably because of the widespread use of

    significance testing, has developed two distinguish-

    able subject matters.

    Significance Testing and Generality

    Traditional significance testing approaches in psy-chology are generally based on data averaged across

    individuals. As is well known, the mean from a

    group of individuals (a sample) provides an estimate

    of the mean of the entire population from which the

    sample is drawn, and that estimate can be bounded

    by confidence intervals that provide information

    (not the probability, however, that the population

    mean falls within the interval; see Smithson, 2003)

    about how confident one can be that the population

    mean lies within such intervals. Thus, the sample

    mean provides information about a parameter thatapplies to the entire population. That fact appears to

    imply substantial generality; it applies to the entire

    population (however delimited), so generality

    appears maximized. This raises two important issues.

    First is the question of representativeness of the

    means, both sample and population. That is, identi-

    cal or similar means can result from substantially

    different distributions of scores. Two examples that

    illustrate this fact are given in Figures 7.1 and 7.2.

    In Figure 7.1, four distributions of 20 scores are

    arrayed horizontally in the upper panel. In the toprow, the values are arithmetically separated, whereas

    in the other three, they are clustered in various

    ways. Note that none of the four is particularly nor-

    mal in appearance, that is, clustered in the middle.

    The four plots in the lower panel showwith the

    top plot corresponding to the top distribution in the

    upper panel, and so onthe means (solid points)

    and standard deviations (bars) of the four distribu-

    tions. They are, as planned, identical. These data

    show that identical means and standard deviations,

    the stock in trade of inferential statistics, can beobtained from very different distributions of values.

    That is, in these cases the means and standard devia-

    tions do not provide a particularly informative or

    representative indication of what the individual val-

    ues are, which implies that when dealing with aver-

    ages of measures, or averages across individuals,

    attention must be paid to the representativeness of

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    4/25

    Branch and Pennypacker

    154

    FIGURE 7.1. Upper panel: Four distributions of values,with each symbol representing one value on thex-axis. Lower

    panel: The corresponding means and standard deviations ofthe four corresponding distributions from the upper panel.From The Elements of Graphing Data(rev. ed., p. 215), byW. S. Cleveland, 1994, Summit, NJ: Hobart Press. Copyright1994 by AT&T Bell Laboratories. Reprinted with permission.

    X

    0 5 10 15 20

    0

    2

    4

    6

    8

    10

    12

    14

    X

    0 5 10 15 20

    0

    2

    4

    6

    8

    10

    12

    14

    X

    0 5 10 15 20

    0

    2

    4

    6

    8

    10

    12

    14

    X

    0 5 10 15 20

    Y

    Y

    Y

    Y

    0

    2

    4

    6

    8

    10

    12

    14

    FIGURE 7.2. Anscombes quartet. Each of the four graphs shows 11xypairs and the best-fitting (least-squares estimate) straight line through the points. The slopes and intercepts of thelines are identical. From Graphs in Statistical Analysis, by F. J. Anscombe, 1973,AmericanStatistician, 27,pp. 1920. Copyright 1973 by the American Statistical Association. Adaptedwith permission. All rights reserved.

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    5/25

    Generality and Generalization of Research Findings

    155

    the mean, not just its value, or even its standard

    deviation. Figure 7.2, which contains what is known

    asAnscombes quartet(Anscombe, 1973), provides

    an even more dramatic illustration of how focusing

    only on the average of a set of numbers can lead one

    to miss important features of that set. The four

    graphs in Figure 7.2 plot 11 values inxycoordi-nates and show the best-fitting (via the least-squares

    method) straight line to the data. Obviously, the dis-

    tributions of points are quite different in the four

    sets. Nevertheless, the means for thexvalues are all

    the same, as are their standard deviations. The same

    is true for theyvalues (yielding eight instances of

    the sort shown in Figure 7.1). In addition, the slopes

    and intercepts of the straight lines are identical for

    all four sets, as are the sums of squared errors and

    sums of squared residuals. Thus, all four would

    yield the same correlation coefficient describing therelation betweenxandy.

    The point of these illustrations is to indicate

    that a sample mean, even though a predictor of a

    population mean, is not necessarily a good descrip-

    tion of individual values, so it is not necessarily a

    good indicator of the generality across individual

    measures. When the measures come from individ-

    ual people (or other nonhuman animals), it follows

    that the average of the group may not reveal, and

    may well conceal, much about individuals. It is

    important to remember, therefore, that samplemeans from a group of individuals permit infer-

    ences about the population average, but these

    means do not permit inferences to individuals

    unless it is demonstrated that the mean is, in fact,

    representative of individuals. Surprisingly, it is rare

    in psychology to see the issue of representativeness

    of an average even mentioned, although recently,

    in the domain of randomized clinical trials, the

    limitations attendant to group averages have been

    gaining increased mention (e.g., Penston, 2005;

    Williams, 2010).Many experimental designs, nevertheless, involve

    comparison across groups with large numbers of

    subjects, which raises the question of the practical-

    ity of presenting the data of every individual. The

    concern is legitimate, but the problem is not solved

    by resorting to the study of group averages only.

    Excellent techniques for comparing distributions,

    like stem-and-leaf plots, box plots, and quantile

    quantile plots, are available (Cleveland, 1994;

    Tukey, 1977). They provide a more complete

    description of measures from individuals, or a useful

    subset (as can be the case with quantilequantile

    plots), than do simple means and standard errors or

    means and confidence intervals. We presume thatas null-hypothesis significance-testing approaches

    become less prevalent, more effort will be directed

    toward developing new and better techniques for

    comparing distributions, methods that will include

    and make evident the measures from individuals.

    Two Separable Subject Mattersfor Psychology?In some instances, the difference between a popula-

    tion parameter, such as the population average, and

    the activity of an individual is obvious. For example,consider the average rate of pregnancy in women

    between 20 and 30 years old. Suppose that rate is

    7%. That, of course, is a useful statistic and can be

    used to predict how many women in that age cate-

    gory will be pregnant. More important for the pres-

    ent purposes, however, is that the value, 7%, applies

    to no individual woman. That is, no woman is 7%

    pregnant. A woman is either pregnant or she is not.

    What of situations, however, in which an average

    is representative of the behavior of individuals? For

    example, suppose that a particular teaching tech-nique is discovered to result in a 10% increase in

    performance on some examination and that the

    improvement is at or near 10% for every individual.

    Is that not a case in which a group average would

    permit estimation of a population mean that is, in

    fact, a good descriptor of the effect of the training

    for individuals and, because it applies to the popula-

    tion, has wide generality? The answer is yes and no.

    The point to be made here is somewhat subtle,

    and so we elaborate on it with an example. Consider

    a situation in which a scientist is trying to determinethe relation between amount of practice at solving

    five-letter anagrams and subsequent speed at solving

    six-letter anagrams. Suppose, specifically, that no

    practice and 10, 50, 100, and 200 anagrams of prac-

    tice are to be compared. After the practice, subjects

    who have never previously solved anagrams, except

    for those seen in the practice phase, are given 50 new

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    6/25

    Branch and Pennypacker

    156

    anagrams to solve, and the time to complete is

    recorded. Because total practice might be a determi-

    nant of speed, the scientist opts to use a between-

    groups design, with each group being exposed to

    one of the practice regimens. That is, the hope is to

    extract the seemingly pure relation between practice

    and later speed, uncontaminated by prior relevantpractice. The scientist then averages the data from

    each group and uses those means to describe the

    function relating amount of practice to speed of solv-

    ing the new, more difficult anagrams. In an actual

    case, variability would likely be found among indi-

    viduals within each group, so one issue would be

    how representative the average is of each member

    of each group. For our example, however, assume

    that the average is representative, even perfectly so

    (i.e., every subject in a group gives exactly the same

    value). The scientist has generated a function, proba-bly one that describes an increase in speed of solving

    anagrams as a function of amount of prior practice.

    In our example, that function allows us to predict

    exactly what an individual would do if exposed to a

    certain amount of practice. Even though the means

    for each group are representative and therefore per-

    mit prediction about individual behavior, the impor-

    tant point is that the function has no meaning for

    an individual. That is, the function does not describe

    something that would occur for an individual

    because no individual can be exposed to differentamounts of practice for the first time. The function

    is an actuarial account, not a description of a

    behavioral process. It is, of course, to the extent

    that the means are representative, a useful finding.

    It is just not descriptive of a behavioral process in

    an individual. To examine the same issue at the

    level of an individual would require investigation

    of sequences of amounts of practice, and that

    examination would have to include experiments

    that factor in the role of repeated practice. Obvi-

    ously, such an endeavor is considerably more com-plicated than the study that generated the actuarial

    curve, but it is the only way to develop a science of

    individual behavior. The ontogenetic roots of

    behavior cumulate over lifetimes. In later portions

    of this chapter, we discuss how the complications

    may be confronted.

    The point is not to diminish the value of actuar-

    ial data, nor to suggest that psychologists abandon

    the collection and analysis of such data. If means are

    highly representative, such data can offer predic-

    tions at the individual subject level. Even if the

    means are not highly representative, organizations

    such as insurance companies and governments canand do make important use of such information in

    determining appropriate shared risk or regulatory

    policy, respectively. The point is, using insurance

    rates as an example, that just because you are in

    a particular group, for example, that of drivers

    between the ages of 16 and 25, for which the mean

    rate of accidents is higher than for another group,

    does not indicate that you personally are more likely

    to have an automobile accident. It does mean, how-

    ever, that for the insurance company to remain prof-

    itable, insurance rates need to be higher for allmembers of the group. Similarly, with respect to

    health policy, even though most people who smoke

    cigarettes do not get lung cancer, the incidence of

    lung cancer, on a relative basis, is substantially

    greater, on average, in that group. Because the group

    is large, even a low incidence rate yields a substan-

    tial number of actual lung cancer cases, so it is in

    the governments, and the populations, interest to

    reduce the number of people who smoke cigarettes.

    The crux of the matter is that actuarial and

    behavioral data, although related in that the formerdepend on the latter, are distinguishable and, there-

    fore, should be distinguished. Psychology, to the

    extent that it relies solely on the methods of inferen-

    tial statistics that use averages across individuals,

    becomes an actuarial science, not a science of behav-

    ioral processes. The methods described in this

    chapter are aimed at including in psychology its

    oft-stated goal of being a science of behavior (or of

    the mind). Behavioral and inferred mental processes

    really make sense only at the level of the individual.

    (The same is true of physiology, which has become arather exact science in part because of the influence

    of Claude Bernard, 1865/1957.) A persons behavior,

    including thinking, imagining, and so forth, is par-

    ticular to that person. That is, people do not share

    their minds or their behavior with others, just as

    they do not share their physiology. A counterargument

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    7/25

    Generality and Generalization of Research Findings

    157

    is that behavior and mental activity are too variable

    from individual to individual to permit a scientific

    analysis. We based this chapter on the more opti-

    mistic view that such activity is amenable to study at

    the level of the individual. Because a good deal of

    application of psychological knowledge involves

    dealing with individuals, for example, as in psycho-therapy, understanding at the level of the individual

    remains a worthy goal. Support for the viewpoint

    that a science of individual behavior is possible,

    however, requires an elaboration of how an individ-

    ual subjectbased analysis can yield information that

    is more generally applicable to more than one or a

    few individuals.

    Why Single-Case Designs Do NotMean That N=1

    Traditional approaches, with the attendant limita-tions described thus far, likely arose, at least in part,

    because of a legitimate concern about focusing

    research on individual subjects who are studied

    repeatedly through time (more on this later). Such

    research is usually performed with relatively few

    subjects, leaving open the possibility that effects

    seen might be limited with respect to generality

    across other individuals. An example, modeled after

    one offered by Sidman (1960), provides a response

    to such misgivings. Suppose we were interested in

    whether listening to classical music while solvingarithmetic problems improves accuracy. Using a

    single-case approach, the study is started with a single

    subject. We might first establish a baseline of accu-

    racy (more on this later) by measuring it over several

    successive exposures. Next, we would test the sub-

    ject with the music present and then with it absent.

    Suppose we find that accuracy is increased when

    music is present and reverts to normal when it is

    not. Suppose also that unbeknownst to us, the

    effect music will have depends on the baseline level

    of accuracy; if accuracy is initially low, it isenhanced by the presence of music, whereas if it is

    initially high, it is reduced when the music is on.

    We might mistakenly conclude, on the basis of the

    results from the one subject, that music increases

    accuracy of solving the kinds of arithmetic prob-

    lems used.

    Let us compare how a more traditional between-

    groups approach might fare in dealing with the

    issue. We apply music to one group and not to

    another. What will result will depend on the distri-

    bution of baseline accuracy across individuals.

    Figure 7.3 shows three possible population distribu-

    tions. In B, most people have low accuracy, in Cmost have high accuracy, and in A people fall into

    two groups with respect to baseline accuracy. If one

    performed the experiment on groups and took the

    group mean to be the indicator of the effect of the

    independent variable, the conclusion would depend

    on the underlying distribution. In A, the conclusion

    FIGURE 7.3. Three hypothetical frequency distribu-tions characterizing the number of people display-ing different baseline rates. From Tactics of ScientificResearch: Evaluating Experimental Data in Psychology(p. 149), by M. Sidman, 1960, New York, NY: BasicBooks. Copyright 1988 by Murray Sidman. Reprintedwith permission.

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    8/25

    Branch and Pennypacker

    158

    might well be that music has no effect, with the low-

    ered accuracy in people with high baseline accuracy

    canceling out the increases that result among those

    with low baseline accuracy. If the population is dis-

    tributed as in B, the conclusion would be that music

    increases accuracy because the mean would move in

    the direction of improved accuracy. The importantpoint is that simply considering the group average

    makes it less likely that the baseline dependency that

    underlies the effect will be seen.

    Let us now compare what might transpire with

    the single-case approach, an approach based on rep-

    lication. Having seen the effect in the first subject,

    we recruit a second and do the experiment again.

    Suppose that the population distribution is as

    depicted in Figure 7.3B. The most likely scenario is

    that the second subject will also have low baseline

    accuracy because someone sampled from the popu-lation is most likely to manifest modal characteris-

    tics. We get the same result and could, mistakenly,

    conclude that music enhances arithmetic accuracy.

    That is, we make the same mistake as with the

    group-average approach. The difference between the

    two approaches, however, is that the group mean

    approach makes it more difficult to discover the

    underlying, real effect. The single-case approach,

    however, if enough replications are done, will even-

    tually and inevitably reveal the problem because

    sooner or later someone with high baseline accuracywill be examined and show a decrease. A key phrase

    in the previous sentence is if enough replications

    are done. Whether that happens is likely to depend

    on the perceived importance of the effect. If it is

    deemed important, it is likely to be subjected to

    additional research, which will, in turn, lead to addi-

    tional replications. Thus, the single-case approach is

    not some sort of panacea with respect to identifying

    such relations, but it offers a direct path to correc-

    tive action. Of course, it is possible to ferret out the

    baseline dependency using a group-mean approach,but that will happen only if attention is paid to the

    data of individual subjects in a group. In the single-

    case approach, those data are automatically scruti-

    nized. A major point is that single casedoes not

    necessarily imply that only one or even only a few

    subjects be examined. Some research questions

    might involve examination of many subjects. (We

    discuss later how to decide how many subjects to

    test.) What the approach involves is studying each

    subject essentially as an independent experiment.

    Generality across subjects is therefore examined

    directly by seeing how often the experiments effects

    are replicated. A second major point is that the

    apparent virtues of studying many subjects, a stan-dard aspect of traditional research designs in psy-

    chology, are realized only if the data from each

    subject are individually analyzed.

    Null-Hypothesis Significance Testingand Theory DevelopmentA major goal in any science is the development of

    theory, and there is a sense in which theory has clear

    relevance to generality. Effective theories are those

    that account for a wide array of research results. That

    is, they apply generally. The way in which signifi-cance testing is most commonly used in psychology,

    however, mitigates against the orderly development

    and testing of theories and against the analysis of

    competing theories. The problem was first identified

    as a paradox by Meehl (1967; see also Meehl, 1978).

    The problem is a logical one based largely on the

    choice of the null hypothesis as no effect. The logic

    of the common approach is as follows. An investiga-

    tor has a hypothesis that imposition of a variable,X,

    will change another measure, Y.This hypothesis is

    sometimes called the alternative hypothesis.Thenull hypothesis is then chosen to be thatXwill not

    change Y,that is, that it will be without effect. Next,

    theXcondition is imposed, and Yis measured. A

    comparison is then made of YwithoutXand Ywith

    X.A statistic is then calculated that is generally a

    ratio of changes in Yas a result ofXover changes in

    Yas a result of anything else. In more technical

    terms, the statistic is effect variance over error vari-

    ance. The larger the statistic, the smaller thepvalue,

    and the more likely it is that statistical significance is

    achieved and the null hypothesis rejected. Standardteaching demands that even though one can decide

    to reject the null hypothesis, logic prevents one from

    accepting the alternative hypothesis. Instead, one

    would say that if the null hypothesis is rejected, the

    alternative hypothesis gains support.

    The paradox noted by Meehl (1967) arises

    from the nature of the statistic itself. The size of the

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    9/25

    Generality and Generalization of Research Findings

    159

    statistic is controlled by two values, the effect size

    and the error variance, so it can be increased in two

    ways. The way of interest for this discussion is via

    a decrease in error variance, the denominator. A

    major way of decreasing error variance is through

    increased experimental rigor (one avenue of which

    is to increase the number of subjects). To the degreethat extraneous variables (the anything else men-

    tioned earlier) can be eliminated or held constant,

    error variance should decrease, making it more

    likely that that statistic will be large enough to war-

    rant a decision as to statistical significance. The

    paradox, therefore, is that as experimental rigor is

    increasedthat is, as experimental techniques are

    refined and improvedstatistical significance

    becomes more likely, with the consequence that the

    alternative hypothesis gains support, no matter what

    the alternative hypothesis is. That does not seemlike a recipe for cumulative progress in science. Sim-

    ple null-hypothesis significance testing with the null

    hypothesis set at no effect cannot, by itself, help to

    develop theory.

    Meehl (1967) described one approach that can

    obviate this paradox, which is to use significance

    testing with a null hypothesis that is not no effect.

    Instead, the null hypothesis is what the theory (or

    alternative hypothesis) predicts. Consider how the

    logic operates when this tactic is used. As experi-

    mental rigor increases, error variance is decreased,making it more likely that the resulting statistic will

    reach a critical value. When that value is achieved,

    the null hypothesis is rejected, but in this case it is

    the investigators theory that is rejected. Rather than

    increased experimental rigor resulting in its being

    easier for ones theory to gain support, it results in

    its being easier to reject ones theory. Increasing

    experimental control puts the theory to a more rig-

    orous test, not an easier one as is the case when

    using the no-effect, or no-difference, null hypothe-

    sis. The harder one works to reject a theory and failsto succeed, the more confidence one has in the

    theory.

    Training in statistical inference, at least for psy-

    chologists, does not usually emphasize that the null

    hypothesis need not be no effect. It can, neverthe-

    less, as just noted, be some particular effect. Note

    that it has to be some specific value other than zero.

    The use of a particular value as the null hypothesis

    therefore requires that ones theory be quantitative

    enough to generate a specific value. This approach is

    what characterizes tests of goodness of fit (those that

    use significance tests) of quantitatively specified

    functions.

    This approach of setting the null hypothesis ata value predicted by theory is nevertheless not

    immune to the previously described weaknesses of

    significance testing in general. If, however, signifi-

    cance testing is used to make decisions, at least this

    latter approach does not suffer from the weakness of

    making it easier to support a researchers theory,

    regardless of what it is, as methods improve.

    In this section of the chapter, we have made the

    case, we hope, that commonly used psychology

    research methods have limitations in assessing reli-

    ability and generality of research findings. In addi-tion, the methods have resulted in many areas of

    psychology being largely actuarial, group-average

    focused science rather than aimed at the behavior of

    individuals. In the next section, we describe the

    basics of an alternative approach that is based on

    replication rather than significance testing and

    group averages. It is useful to remember that impor-

    tant science was conducted before the invention of

    significance testing, and what follows is a descrip-

    tion of the application of methods used to establish

    most of modern physics and chemistry (and physiol-ogy) to the study of behavior. The approach focuses

    on understanding behavioral processes, rather than

    actuarial ones, and has already yielded a good deal

    of success, as other chapters in Volume 2 of this

    handbook illustrate. We should note, nevertheless,

    that even if the goal is actuarial prediction and influ-

    ence, the methods of statistical inference are limited

    in what they can achieve with respect to reliability

    of research findings. As we argue, the only sure way

    to examine reliability of results is to repeat them, so

    replication is the key strategy for both subject mat-ters of psychology.

    ASSESSING RELIABILITY AND

    GENERALITY VIA REPLICATION

    The two distinguishable categories of replication are

    direct replication and systematic replication,

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    10/25

    Branch and Pennypacker

    160

    although, as we show, the distinction is not a sharp

    one. Most researchers are familiar with the concept

    of direct replication, which refers to repeating an

    experiment as exactly as possible. If the results are

    the same or similar enough, the initial effect is said

    to be replicated. Direct replication, therefore, is

    mainly used to assess the reliability of a researchfinding, but as we show, there is a sense in which it

    also provides information about generality. System-

    atic replication is the designation for a repetition

    of the experiment with something altered to see

    whether the effect can be observed in changed cir-

    cumstances. If the results are replicated, then the

    generality of the finding is extended to the new cir-

    cumstances. Many varieties of systematic replication

    exist, and it is the strategy most relevant to examin-

    ing the generality of research findings.

    Direct Replication: Within-SubjectReliability and BaselinesIn the first part of this section, we describe the

    methods and roles of direct replication with the

    same experimental subject (i.e., a truly single-case

    experiment). We open with this simplest case, and

    with an example, not only to illustrate how the strat-

    egy can be used, but also to deal more clearly with

    reservations about and limitations of the approach

    as well as how decisions about characteristics of the

    replicative process may be made.For our example, suppose that we want to mea-

    sure the amount of a certain kind of food eaten after

    some period without food. We let our subject eat

    after 12 hours of fasting; suppose that she eats

    250 grams. Direct replication of this observation

    would require that we do the same test again. One

    possible, but unlikely, result would be that she

    would eat 250 grams again, providing an exact repli-

    cation. The amount eaten would more likely be

    slightly different, say 245 grams. We might then

    conduct another replication to see whether the trendtoward eating less was replicable. Suppose on that

    occasion our subject eats 257 grams, making it less

    likely that there is a trend toward less ingestion with

    successive tests. We could repeat the process again

    and again. By repeatedly observing the amount eaten

    after a 12-hour fast, we gain more confidence with

    each successive measurement about how much our

    subject will eat of that particular food after 12 hours

    of not eating.

    One thing that direct replication can provide, via

    a sequence of direct, intrasubject replications such

    as that just described, is a baseline. The left segment

    of Figure 7.4 shows that there appears to be a steady

    baseline amount of intake in our example. A ques-tion that might arise is how many observations are

    needed to establish a baseline, that is, to come up

    with a convincing assessment? The answer is that it

    depends. There is no rule or convention about how

    many replications are needed to render an outcome

    considered reliable in the eyes of the scientific com-

    munity. One factor of importance is how much is

    already known. In some of the more advanced

    physical sciences, a single replication (usually by a

    different research team) might be adequate. In our

    example, the researcher might have conducted simi-lar research previously, discovered that the baseline

    value does not change after 10 observations, and

    thus deemed 10 replications enough. The researcher

    who chooses replication as a strategy to determine

    reliability of findings, therefore, does not have the

    comfort of a set of conventions (akin to those avail-

    able to investigators who use conventional levels

    of statistical significance) to decide whether to

    conclude if an effect is reliable enough to warrant

    reporting to the scientific community. Instead, the

    investigators judgment plays a role, and his or herscientific reputation is dependent to some degree on

    Successive Tests

    0 2 4 6 8 10 12 14 16 18 20 22 24 26

    GramsEaten

    0

    50

    100

    150

    200

    250

    300

    Baseline - Food 1 Food 2 Food 1

    FIGURE 7.4. Hypothetical data from a series of obser-vations of eating. The first 10 points and last six pointsare amounts eaten of Food 1. The middle six points areamounts eaten of Food 2.

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    11/25

    Generality and Generalization of Research Findings

    161

    that judgment. One of the comforts of a set of con-

    ventions is that if a researcher abides by them and

    results are later discovered, via failed attempts at

    replication, not to be reliable, that researchers repu-

    tation suffers little. In contrast, one can argue that

    there are both advantages and disadvantages to rely-

    ing on replication. Important advantages are havingthe benefit of informed judgment, especially of a

    seasoned investigator, and the fact that social pres-

    sure rides more directly on the researchers reputa-

    tion. The disadvantage comes from the lack of an

    agreed-on set of conventions. Principled arguments

    about which is better for science can be made for

    both positions, but we favor the view that science,

    as a socialbehavioral activity, will fare better, or at

    least no worse, if researchers are held more account-

    able for their conclusions about reliability and

    generality than for their adherence to a set of arbi-trary, often misunderstood conventions.

    Returning to the role of a baseline construed as a

    set of intrasubject replications, such baselines can

    serve as bases of comparison for effects of experi-

    mental changes. For example, after establishing a

    baseline of eating of the first food, we could change

    what the food is, perhaps less tasty or more or less

    calorie laden. The second set of points in Figure 7.4,

    which in essence depict measures from a second

    set of replications, have been chosen to indicate a

    decrease. The reliability of the effect is illustrated bythe successive similarity of values, and judgments

    about how many replications are needed would

    be based on the same sorts of considerations as

    involved in the original baseline. A usual check

    would involve return to the original food, and the

    third set of points indicates a likely result, once

    again with a series of replications. The overall exper-

    iment, therefore, is an example of the ubiquitous

    A-B-A design (see Chapter 1, this volume).

    Replication, of course, need not refer only to a

    series of successive measurements under identicalconditions to produce a baseline. If the type of find-

    ing summarized in Figure 7.4 were especially coun-

    terintuitive or at considerable odds with existing

    knowledge, one might well repeat the entire project,

    Food 1 to Food 2 to Food 1, and that, too, would

    constitute a direct intrasubject replication. In fact,

    the entire project could be carried out multiple

    times if, in the investigators judgment, such confir-

    mation was necessary. Each successful replication

    increases confidence that the independent variable,

    change of food type, is responsible for the change

    in eating.

    Direct Replication: Between-SubjectsReliability and GeneralityAfter all this work, an immediate limitation is that

    the findings, so far as we know, may well apply

    only to the one person studied. Our first result is

    based on intrasubject replication. If the goal of the

    research was to see whether the change in food can

    influence eating, then it may be the case that no fur-

    ther replication is needed. It is likely, however, that

    our interest extends beyond what is possible to what

    is common. In that case, additional replication is in

    order, which brings us to the next type of direct rep-lication, replication with different subjects, or inter-

    subject replication. Intersubject replication is used to

    examine generality, in this case across subjects, and

    in this single-case design Nis extended to more

    than 1. Intersubject replication makes clear the fuzz-

    iness of the distinction between direct and system-

    atic replication. The latter is generally defined as a

    replication with something changed (see below),

    and a new subject is certainly a change. We also sug-

    gest that systematic replication is a main strategy for

    assessing generality, and by studying a second sub-ject, generality across individuals is on trial. It is

    even possible to suggest that most replications, even

    intrasubject replications, are, in fact, systematic. For

    example, in the intrasubject replication described

    above, time is different for successive observations,

    and the subject brings a different history to each

    observation period. It nevertheless has become stan-

    dard to characterize replications in which the proce-

    dures are essentially the same as direct replications.

    As we outline shortly, systematic replications are

    characterized by changes in procedure or conditionsthat can be quite substantial.

    As noted in the section Significance Testing and

    Generality earlier in this chapter, an emphasis on

    replication with individual subjects approaches the

    issue of subject generality by increasing the number

    of subjects studied. Suppose, for the sake of our

    example, we study a second subject, performing the

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    12/25

    Branch and Pennypacker

    162

    entire experiment, baseline, new food, baseline, and

    the whole sequence, over again. There are two major

    classes of outcomes. One, we get the same effect.

    Two, we do not. Let us deal initially with the former

    possibility. The first issue is what we would accept

    as same. The second persons baseline level would

    likely not be exactly the same, and in fact, it mightbe notably different, averaging, say, 200 grams.

    Should we count that as a failure to replicate? The

    answer is (again), it depends. If our major concern

    was the exact amount eaten and the factors contrib-

    uting to that, then the result might well be consid-

    ered a failure to replicate. We will hold off for a bit,

    however, on what to do in the face of such failures,

    and move forward on the assumption that we are

    not so much concerned with the exact amount eaten

    as with whether the change in food results in a

    change in amount eaten. In that case, we might rep-licate, with the second subject, the whole sequence

    of conditions, Food 1, Food 2, and back to Food 1.

    Two possibilities exist: The results are the same as

    for the first subject or they are not, and again, conse-

    quently, an important issue is what is meant by

    same.The results are unlikely, especially in behav-

    ioral science, to be identical quantitatively, and,

    in fact, if the baseline is different, the change in

    intake cannot be identical in both absolute and rela-

    tive terms, so we are left to decide whether to focus

    on what is different or on what is similar. In thisstage of the discussion, let us assume that intake

    decreased, as it had for the first subject. In that case,

    we might feel confident that an important feature of

    the data has been replicated. A next question, then,

    would be whether additional replication with other

    subjects is needed. In this particular example, the

    answer would most likely be yes, but as is generally

    the case, the real answer is that it depends on what

    the goals of the experiment are.

    Behavioral scientists, by and large, tend to focus

    on similarities rather than differences, so if featuresof data reveal similarity across individuals, those

    similarities are likely to be pursued. Consider, there-

    fore, a situation in which the data for the second

    subject are dissimilar, not only in quantitative terms

    but in qualitative ones as well. For example, sup-

    pose that for the second subject the change from

    Food 1 to Food 2 results in an increase in amount

    eaten rather than a decrease. Here, there is no ques-

    tion that an important aspect of the first result has

    not been replicated. What is to be done then? The

    answer lies in the assumption of determinism that is

    at the core of behavioral science. If there is a differ-

    ence observed between Subject 1 and Subject 2, that

    difference is the result of some other influence. Thatis, people do not differ for no reason. In fact, the

    failure to replicate the exact intake levels at baseline

    must also be a result of some factor. Failure to repli-

    cate, therefore, is an occasion on which to initiate a

    search for the variable or variables responsible for

    the differences in outcomes. Suppose, for example,

    that Subject 1 was female, and Subject 2 was male.

    Tests with other men and women (note the expan-

    sion of N) could reveal whether this factor was

    important in determining the outcome. Similarly,

    we have already assumed different baseline levels, soit might be the case that baseline level is related to

    the direction of change in intake, a hypothesis that

    can be examined by studying additional subjects. It

    is interesting that examination of this second possi-

    bility could be aided if the issue of different base-

    lines between Subject 1 and Subject 2 had been

    assumed to be a failure to replicate. In that case, we

    would have focused on reasons for the difference

    and may have identified factors that determine base-

    line level. If that were so, it might be possible to

    control the baseline levels and to change them sys-tematically, thus providing a direct method for

    studying the relation between baseline level and the

    effect of changing the food.

    Another possible reason that disparate effects are

    observed between subjects is differing sensitivity to

    the particular value of the independent variable

    used. In the example just described, the indepen-

    dent variable was characterized qualitatively as a

    change in food type, making assessment of sensitiv-

    ity to it difficult to assess. If, however, the indepen-

    dent variable can be characterized quantitatively, forinstance by carbohydrate content in our example,

    the technique of systematic replication, elaborated

    below, can be used to examine the possibility.

    An important issue in considering direct replica-

    tion arises when intersubject replication succeeds

    but intrasubject replication does not. Taking our

    example, suppose that when the conditions were

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    13/25

    Generality and Generalization of Research Findings

    163

    changed back to Food 1 with our first subject (cf.

    Figure 7.4), eating remained at the lower level,

    which would prevent replication of the effect in

    Subject 1. Such a result indicates either that some

    variable other than the change of food was responsi-

    ble for the decrease in eating or that the exposure to

    Food 2 has produced a long-lasting change in eat-ing. Support for the second view can come from

    attempts at intersubject replication. If experiments

    with subsequent subjects reveal that a shift from

    Food 1 to Food 2 results in a relatively permanent

    decrease in eating, the effect is verified.

    When initial effects are not recaptured after

    intervening experience that produces a change, the

    change is said to be irreversible. Using replication to

    examine irreversible effects requires intersubject

    replication, so we have here another instance in

    which N=1 does not mean that only one subjectneed be studied. Many effects in psychology are irre-

    versible, for example, those that we call learning, so

    the individual subject approach requires that inter-

    subject replication be used to assess the reliability of

    such effects, and in so doing the generality of the

    effect across subjects is automatically examined.

    A focus on each subject individually, of course,

    does not prevent the use of traditional data analysis

    approaches, should an investigator be so inclined

    (for inferential statistical analyses appropriate to

    single-case research designs, see Chapters 11 and12, this volume). Some, for example, might want to

    present group averages so that actuarial predictions

    can be made. Standard techniques can be used sim-

    ply by engaging in the usual sorts of data manipula-

    tion. An emphasis on the data from individuals,

    nevertheless, can be used to enhance the presenta-

    tion. For example, consider a study by Dunn,

    Sigmon, Thomas, Heil, and Higgins (2008), who

    compared two conditions aimed at reducing ciga-

    rette smoking. In one, vouchers were given contin-

    gent on breath samples that indicated that nosmoking had occurred, whereas in the other the

    vouchers were given independently of whether the

    subject had smoked. Figure 7.5 shows some of the

    results. The bars show group means, and the dots

    show data from each individual, illustrating the

    degree to which effects were replicable across

    patients and the representativeness of the group

    averages. Such a display of data provides consider-ably more useful information than do presentations

    that include only means or results of tests of statisti-

    cal significance.

    Systematic Replication: ParametricExperimentsTo this point, our emphasis has been on the

    intra- and intersubject generality and reliability of

    effects, and we have argued that individual subject

    approaches can be effectively used to assess it. Gen-

    erality of effects, however, is not limited to general-

    ity across individuals, and it is to other forms of

    generality, culminating with scientific generality, to

    which we now turn.

    As noted earlier, systematic replicationrefers to

    replication with something changed, and, as also

    noted, a case can be made that replication with a

    new subject is a form of systematic replication in

    FIGURE 7.5. Number of days of continuous absti-nence from smoking cigarettes in two groups of sub-

    jects. Circles are data from individuals. Open bars andbrackets show the group means and standard errors

    of those means. Subjects represented by the left barreceived vouchers contingent on abstinence, whereasthose represented by the right bar received vouchersindependent of their behavior. The top bracket andasterisk indicate that the mean difference was statisti-cally significant at the .01 level. From Voucher-BasedContingent Reinforcement of Smoking AbstinenceAmong Methadone-Maintained Patients: A Pilot Study,by K. E. Dunn, S. C. Sigmon, C. S. Thomas, S. H. Heil,and S. T. Higgins, 2008,Journal of Applied BehaviorAnalysis, 41,p. 533. Copyright 2008 by the Society forthe Experimental Analysis of Behavior, Inc. Reprintedwith permission.

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    14/25

    Branch and Pennypacker

    164

    that it is an experiment with something changed,

    namely the experimental subject. From such replica-

    tions come assessments of the across-subject gener-

    ality of effects. In this section, we discuss other sorts

    of changes between experiments that constitute sys-

    tematic replication. To do so, let us begin again with

    our example of effects of food type on eating. Sup-pose that after obtaining the data in Figure 7.4, we

    perform a systematic replication of the study rather

    than a direct repetition. For example, we might

    notice that Food 2s carbohydrate content is higher

    than that of Food 1. We decide, therefore, to alter

    the carbohydrate content of Food 2 (and let us

    assume, likely impossible, without changing the

    taste) so that it matches that of Food 1, and repeat

    the experiment. Such an experiment would examine

    the generality of Food 2s effect on eating to a new

    carbohydrate level. If adjusting Food 2s carbohy-drate amount to equal that of Food 1 resulted in the

    switch in foods having no effect on eating, two

    things can be concluded. One, the original result

    was not replicated. In such cases, it is often wise

    to replicate the original experiment to determine

    whether unknown variables might have been

    responsible. Two, carbohydrate amount is identified

    as a likely important variable. Thus, systematic rep-

    lication is not only a method for discovering gener-

    ality of effects, it is also an approach that can lead to

    finding controlling variables.Continuing our description of types of systematic

    replication, let us assume we decide to examine

    more fully the role of carbohydrates in eating. Our

    original experiment may be conducted several times

    but with a different carbohydrate mix for Food 2 on

    each occasion. Each repetition of the experiment,

    then, constitutes a systematic replication because a

    new value of carbohydrate is used for each instance.

    Experiments that systematically vary the value of a

    variable are calledparametricexperiments, and they

    play an especially important role in assessing gener-ality. Consider the data in Figure 7.6, which are

    constructed to emulate what might result if several

    intersubject replications of a parametric experiment

    were conducted.

    Parametric examination provides a number of

    advantages when assessing the reliability and gener-

    ality of results. First, had only a single value of the

    independent variable been assessed, we might havebeen less than impressed with the degree of inter-

    subject replicability of the data. The results of para-

    metric examination, however, reveal a good deal of

    similarity across the three subjects: All show the

    same basic relation. At low percentages, the amount

    eaten is roughly constant within each individual.

    As the percentage increases, the amount eaten

    decreases until the percentage reaches a value above

    which further increases are associated with no

    changes in amount eaten. Second, and this is a key

    characteristic of parametric evaluation, the data sug-gest that only a range of levels of the independent

    variable result in a change in behavior. That is, para-

    metric experiments permit the identification of

    boundary conditions, or limiting conditions, outside

    of which a variable is relatively ineffective. As we

    show later when dealing with the issue of scientific

    generality, information about boundary conditions

    can be extremely important.

    Figure 7.6 also illustrates how parametric experi-

    ments can help deal with the problem of lack of

    intersubject replicability when a single value of anindependent variable is examined. Recalling our

    original example of comparison of food types, con-

    sider what could have happened if our first two sub-

    jects were Subjects 1 and 3 of Figure 7.6 and Food 1

    had contained 20% carbohydrate and Food 2 had

    contained 25%. Changing the food type would have

    produced a change for Subject 1 but not for Subject 3,

    FIGURE 7.6. Hypothetical data for three subjectsshowing the relationship between carbohydrate contentand amount eaten.

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    15/25

    Generality and Generalization of Research Findings

    165

    leading to a conclusion that we had failed to repli-

    cate the food change effect across subjects. The

    parametric examination, however, shows that both

    subjects are similar in how food intake was influ-

    enced by carbohydrate content, except that behavior

    of the two subjects was sensitive in a slightly differ-

    ent range. One of the most satisfying outcomes ofparametric experiments is when they reveal similari-

    ties that cannot be judged when only a single value

    of an independent variable is tested.

    It is worth noting, too, that parametric experi-

    ments can reveal that apparent intersubject replica-

    bility can be misleading regarding how a variable

    influences behavior. It is possible that tests with a

    single value of an independent variable might lead

    to very similar quantitative results for several sub-

    jects, whereas a parametric analysis reveals that very

    different functions describing the relation betweenthe independent variable happen to cross or come

    close together at the particular value of the indepen-

    dent variable evaluated.

    Parametric experiments illustrate one of the

    strengths of being able to characterize independent

    variables quantitatively. Experiments that determine

    how much of this yields how much of that provide

    more information about generality than do experi-

    ments that simply test whether a particular value

    of an independent variable has an effect. They can

    identify similarity where none is evident with a sin-gle value of an independent variable, and they can

    also determine whether apparent similarity is

    unrepresentative.

    We should note that parametric experiments are

    not limited in application to only primary indepen-

    dent variables, such as that shown in our fictitious

    example. Any variable associated with an experiment

    can be systematically varied. As an example, the

    experiment just described could be conducted under

    a range of temperatures, a range of degrees of hydra-

    tion of the subjects, a range of times without foodbefore the test, and any of several other variables.

    Those experiments, too, would provide information

    about the range of conditions under which the inde-

    pendent variable of carbohydrate content exerts its

    effects in the circumstances of the experiment.

    Parametric experiments, although very important,

    are not the only kind of systematic replications. One

    other type involves using earlier findings as a starting

    point, or baseline, for examination of other variables.

    As an example, consider the phenomenon of false

    memory in the laboratory, produced by a procedure

    originally developed by Deese (1959) and later elabo-

    rated by Roediger and McDermott (1995). In these

    studies, subjects said they recalled or recognizedwords that were not presented. A great deal of

    research followed the original demonstrations, and

    these experiments varied procedural details, measure-

    ment techniques, subject characteristics, and so forth.

    In each instance, therefore, in which the false memory

    effect was reproduced, the reliability of the phenome-

    non was demonstrated and its generality extended.

    Using the reproduction of previous findings as a start-

    ing point for subsequent research, therefore, is a use-

    ful and productive technique for examining reliability

    and generality of research outcomes.Sidman (1960), in his characterization of tech-

    niques of systematic replication, described a type he

    called systematic replication by affirming the conse-

    quent (p. 127). Essentially, this approach is very

    similar to the idea of hypothesis testing because the

    systematic replication is not based on simply chang-

    ing some aspect of the experiment to see whether

    effects can still be reproduced but rather on what the

    investigator sees to be the implications of previous

    results. That is, the replication may be based on the

    investigators interpretation of what the data mean.For example, consider our fictitious study of the

    effects of carbohydrate content on eating. That

    result, and perhaps those of other experiments,

    might suggest that the phenomenon is not specific

    to eating. Carbohydrate ingestion possibly leads to

    general lethargy or low motivation for voluntary

    behavior. If we suspect that, we might devise other

    experiments that could be viewed as systematic rep-

    lications based on the possible implications of the

    previous findings. If the results were consistent with

    the lethargy interpretation, the view would gain incredence; if they were not, the view might well be

    abandoned. As Sidman (1960) noted, definite con-

    clusions may not be drawn from successful replica-

    tions by affirming the consequent, but, as he also

    noted, the approach is essential to science. The

    degree to which ones confidence in an interpreta-

    tion of data grows with successful replications

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    16/25

    Branch and Pennypacker

    166

    depends on many things, not the least of which is

    how counterintuitive the predicted outcome is.

    Types of Generality Assessed andEstablished by Systematic Replication

    Johnston and Pennypacker (2009) offered a useful

    characterization of the dimensions along which gen-erality can be examined. They initially suggested a

    dichotomy between generality of and generality

    across. Generality acrossis simple to understand.

    As we have already noted, replication can be used to

    determine generality across subjects or situations, a

    type of generality usually of considerable interest.

    Systematic replication comes to the fore in the

    assessment of generality across species and across

    settings. By definition, systematic replication is an

    attempt at replication with something different, so if

    the species is changed, or if something (or a lot)about the setting is altered, the replication attempt is

    a systematic one. In both cases, the issue of what

    constitutes a successful replication may arise. Con-

    sider, for example, if we decided to attempt a cross-

    species replication of our experiments with food

    types, and our new species was a mouse. Obviously,

    mice would eat considerably less, and therefore a

    precise, quantitative replication would not be possi-

    ble. We might (actually, probably would), however,

    argue that the replication was successful if the rela-

    tion between carbohydrate content and eating wasreplicated, that is, if at low concentrations there was

    little effect on eating, but as carbohydrate content

    increased, the amount eaten decreased until some

    level is reached above which further decreases were

    not seen (cf. Figure 7.6).

    What if the content values at which the decreases

    begin and end differ between the species? For exam-

    ple, mice may begin to show a decline when the

    food reaches 15% carbohydrate, whereas with the

    humans, decreases are not evident until the food

    contains 25% carbohydrate. Is that a failure to repli-cate? Again, the answer is yes and no. The business

    of science is to find regularities in nature, so empha-

    sis is properly placed on similarities. Differences

    virtually always exist, so they are easy to find. Nev-

    ertheless, they cannot be ignored entirely, but their

    main role is not to indicate that the similarities

    evident are somehow unimportant, but rather to

    promote further research into the origins of the dif-

    ferences if the differences are judged to be impor-

    tant. The scientist and the scientific community

    make judgments about the need for further investi-

    gation of the differences that are always present in

    replications.

    Generality ofalso plays an essential role in sci-ence. Johnston and Pennypacker (2009) described

    several categories of generality of,but here we focus

    on one in hopes of making the concept clear: gener-

    ality of process. Our example is a behavioral process

    familiar to most psychologists, specifically the pro-

    cess of reinforcement of operant (purposive) behav-

    ior. Reinforcementrefers to the increase in likelihood

    of behavior as a result of earlier instances being fol-

    lowed by certain consequences, which is the pro-

    cess. Systematic replications across an immense

    range of both behavioral activities and a very largerange of consequences have been shown to provide

    instances of the process. For example, in addition to

    the traditional lever press and key peck, activities

    ranging from the electrical activity of an impercepti-

    ble movement of the thumb (de Hefferline, Keenan,

    & Harford, 1959), to vocal responses of chicks

    (Lane, 1960), to generalized imitation in children

    with developmental delays (Baer, Peterson, & Sher-

    man, 1967), to the extensive range of activities

    described in the use of reinforcement in the treat-

    ment of behavior disorders (e.g., Martin & Pear,2007; Ullman & Krasner, 1966) have all been shown

    as instances of the process. Similarly, the range of

    events used as effective consequences to produce

    reinforcement is also broad. Consequences such as

    praise, food, intravenous drug administration, open-

    ing a window, reducing a loud noise, access to exer-

    cise, and many, many others have been effectively

    used to produce reinforcement. All the reports

    may be viewed as describing systematic replications

    of the earliest experiments on the process (e.g.,

    Skinner, 1932; Thorndike, 1898).This generality of process is what stands as the

    rationale for speaking of reinforcement theory. The

    argument is similar to that offered for the motion of

    objects. Whatever those objects are, and whether

    they are falling, floating, being ballistically pro-

    jected, or orbiting in outer space, they can be sub-

    sumed under the notion of gravitational attraction,

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    17/25

    Generality and Generalization of Research Findings

    167

    Newtons theory of gravity. An even more dramatic

    example is provided by living things. All manner of

    plants and animals populate the earth, and their dif-

    ferences are obvious and virtually countless. What is

    less obvious but explains the variety is that all life

    can be considered to have developed from the opera-

    tion of three processes: variation, selection, andretention (Darwin, 1859). The sameness of cellular

    architecture, including nuclear material (e.g., DNA

    and RNA), also attests to the similarity. Likewise, all

    the myriad instances of reinforcement suggest that

    considering them instances of a single process is rea-

    sonable. As noted earlier, an important goal of sci-

    ence is to discover uniformities. In fact, as Duhem

    (1954) noted, one of the key features of explanation

    is identification of the like in the unlike. Objects

    look different, are made of different substances, and

    may or may not be moving in variety of ways, butthey are similar in how they are affected by gravity.

    Behavioral activities take on many forms, and as just

    noted, so can the consequences of those activities.

    Nevertheless, they can (on many occasions) exhibit

    the phenomenon known theoretically as reinforce-

    ment, an instance of generality of process.

    Scientific GeneralityAnother extremely important concept is scientific

    generality, a type of generality that has some coun-

    terintuitive characteristics. Scientific generality isimportant for at least two reasons. One, scientific

    generality speaks to scientists ability to reproduce

    their own findings and those of other scientists, as

    well. Two, scientific generality speaks directly to

    the possibility of effective application and translation

    of laboratory findings to the world at large, as dis-

    cussed more fully later in the last section of this

    chapter. Scientific generality is defined by knowl-

    edgeable reproducibility. That is, it is not character-

    ized in terms of breadth of applicability, but instead

    in terms of identification of factors that are requiredfor a phenomenon to occur. To illustrate the differ-

    ence between scientific generality and, for example,

    generality across people, consider again the fictitious

    experiment on food types. Suppose that the original

    experiments were all performed with male subjects.

    On an attempt at replication with female subjects,

    it is discovered that food type, or carbohydrate

    composition, has no effect at all on eating. That, of

    course, would be clear indication of a limit to the

    across-subjects generality of the effect on eating. It

    would, however, represent an increase in scientific

    generality because it specifies more clearly the con-

    ditions required to produce the phenomenon of

    reduced food intake. As stated by Johnston andPennypacker (2009), A procedure can be quite

    valuable even though it is effective under a narrow

    range of conditions, as long as we know what those

    conditions are (pp. 343344). The vital role that

    systematic replication, and even failures of system-

    atic replication, can play in establishing scientific

    generality therefore becomes evident. Scientific gen-

    erality represents an understanding of the variables

    responsible for a phenomenon.

    GENERALIZATION, TECHNOLOGY

    TRANSFER, AND TRANSLATIONAL

    RESEARCH

    The function of any science is the acquisition of

    basic knowledge. A secondary benefit is often the

    possibility of applying that knowledge in ways that

    impart benefit to some element of the culture at

    large. For example, Galileos basic astronomic obser-

    vations eventually led to improved navigation proce-

    dures with attendant benefits to the colonial powers

    of 17th-century Europe. Pasteurs discovery in 1863of the microorganisms that sour wine and turn it

    into vinegar, and the observation that heat would

    kill them, led eventually to the germ theory of dis-

    ease and the development of vaccines.

    In the case of behavior analysis, a relatively

    young science, sufficient basic knowledge has been

    acquired to permit vigorous attempts at application.

    A discipline known as applied behavior analysis,

    discussed extensively elsewhere in Volume 2 of this

    handbook, is the primary result of these efforts,

    although application of the findings of behavioranalysis are to be found in a variety of other disci-

    plines including medicine, education, and manage-

    ment, to name but a few.

    In this section, we describe issues surrounding

    attempts to apply laboratory research findings in the

    wider world at large. Specifically, we discuss topics

    related to applying research findings from controlled

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    18/25

    Branch and Pennypacker

    168

    laboratory or therapeutic settings to new situations

    or less controlled environments. First, we describe

    the issue of generalization of behavioral treatment

    effects from treatment settings to real-world circum-

    stances. Then we outline basic general strategies

    for effective transfer of technologies, taking into

    account the known scientific generality of behav-ioral processes. Finally, we offer comments on the

    notion of translational research, a matter of much

    contemporary interest.

    Generalization of ApplicationsOne of the earliest subjects of discussion that arose

    with the development of behavior therapy and

    behavior modification techniques was the issue

    referred to as generalization (e.g., Yates, 1970). Spe-

    cifically, there was concern about whether improve-

    ments produced in a therapy setting would alsoappear in other, nontherapy (e.g., everyday life) sit-

    uations. The term generalizationwas borrowed from

    a core behavioral process discovered by experimen-

    tal psychologists, that after learning to respond in a

    particular way in the presence of a particular stimu-

    lus, say frequency of a tone, the same behavior may

    also occur in the presence of other more or less sim-

    ilar stimuli, say, other frequencies of the tone. It is

    an apparently simple logical step to suggest that

    behavior learned in a therapy environment might

    also appear in nontherapy, real-world environments,and when it does so, the result can be called general-

    ization (but see Johnston, 1979, for problems with

    such a simple extrapolation). Because applied

    behavior analysis generally involves establishing

    conditions that alter behavior, the issue of whether

    those changes are restricted to the learning situa-

    tions arranged or whether they become evident in

    other situations is usually important. For example,

    if a child who engages in aggressive behavior is

    exposed to a treatment program to reduce aggres-

    sion, a goal would be to decrease aggression notonly in the treatment setting but in all settings.

    In a seminal article, Stokes and Baer (1977) dis-

    cussed the issue of generalization of treatment

    effects. A key contribution of their article was to

    indicate that in general, if effects of a treatment are

    to be manifested in a variety of circumstances,

    achieving that outcome must be considered in

    designing the intervention intended to effect the

    change in behavior. That is, it is not always suffi-

    cient to simply arrange circumstances that produce

    a desired change in behavior in the circumscribed

    environment in which the treatment is undertaken.

    Instead, procedures should be used that increase the

    probability that the change will be enduring andmanifested in those parts of a clients environment

    in which the changes are useful. That insight has

    been followed by the development of general strate-

    gies to enhance the likelihood that behavior changes

    occur not only in the treatment environment but

    also in other appropriate ones.

    For example, Miltenberger (2008) described

    several general strategies that can be used to pro-

    mote generalization of treatment effects. The most

    direct strategy is to arrange for rewards to occur

    immediately after instances of generalization occur.Such an approach essentially entails taking treat-

    ment to the environments in which it is hoped the

    new behavioral patterns will occur. That is, the

    training environment is not explicitly delimited.

    Such an approach is now widespread in applied

    behavior analysis partly as a consequence of an

    emphasis on analyzing reinforcement functions

    before implementing treatment (see Iwata, Dorsey,

    Slifer, Bauman, & Richman, 1982). This approach

    to problem behavior entails discovering whether

    the behavior is maintained by reinforcement, and ifit is, identifying what the reinforcers are in the

    environments in which the problem behavior

    occurs. Once the reinforcers responsible for the

    maintenance of the problem behavior are identi-

    fied, then procedures based on that knowledge are

    implemented in the situations in which the behav-

    ior occurs.

    A related second strategy identified by Milten-

    berger (2008) is consideration of the conditions

    operating in the environments in which the changed

    behavior would be occurring. The idea here is thatbehavior that is changed in the therapeutic setting,

    for example learning effective social skills, will lead,

    if performed, to more satisfying social interactions

    in the nontherapy environment, and those successes

    will help to solidify the gains made in the therapy

    sessions. In designing the therapeutic goals, there-

    fore, consideration is given to what sorts of behavior

  • 5/20/2018 7 Branch&Pennypacker Generalization and Generality

    19/25

    Generality and Generalization of Research Findings

    169

    are most likely to be successful in the nontherapy

    environment.

    A less obvious strategy applies when the nonther-

    apy environment appears to offer little or no support

    for the changed behavior. An example is when ther-

    apy is aimed at training an adolescent to walk away

    from aggressive provocation in a schoolyard. Behav-ing in such an aggression-thwarting manner is not

    likely to result in positive outcomes with peers, who

    are instead likely to provide taunts and jeers after

    such actions. In such a case, it may be prudent to try

    to change the normal consequences in the school-

    yard by having teachers or other monitors provide

    positive consequences (perhaps in the form of privi-

    leges, praise, etc.) for such actions when they occur.

    That is, the strategy here involves altering the con-

    tingencies operating in the nontherapy environment.

    A fourth general strategy is to try to make thetherapy setting more like the nontherapy environ-

    ment in which the changed behavior is to occur. A

    study by Poche, Brouwer, and Swearingen (1981)

    illustrated this approach. They taught abduction

    prevention skills to preschool children, but in so

    doing incorporated a relatively large number of

    abduction lures in the training. The intent was that

    by including a wide variety of possible lures that

    might be used by would-be kidnappers, the training

    would be more effective in real-world situations

    than if it had not involved those variations. The gen-eral strategy in this case was to train with as many

    likely relevant situations as possible. Another way to

    view this strategy is that it involves incorporating

    stimuli that are present in the nontherapy environ-

    ment into the training.

    A fifth approach is somewhat less intuitive, but

    research has suggested that it may be effective. The

    core idea is that if a variety of different forms of

    effective behavior are established by the therapy or

    training, the chance of effective behavior occurring

    in the nontherapy environment is better, and as aresult the successful behavior will be supported and

    continue to occur. As a simple illustration, Milten-

    berger (2008) offered the example of teaching a shy

    person a variety of specific ways to ask for a date,

    which provides the person with several actions to

    try, some of which are likely to be successful outside

    of therapy.

    In this section, we focused on particular strate-

    gies for ensuring that desired changes in behavior

    established through therapeutic methods occur and

    persist in nontraining or nontherapy environments,

    that is, in the everyday world. Employment of tactics

    emerging from the strategies described has yielded

    many successes, and the methods are part of thearmamentarium of applied behavior analysts. These

    techniques to promote generalization of behavior

    changes have emerged from a consideration of

    fundamental behavioral processes that have been

    identified and analyzed in basic research and then

    subsequently validated as effective through applied

    research. They represent, consequently, what can be

    called successful transfer from basic science to effec-

    tive technology, namely, an instance of what has

    come to be called technology transfer.In the next

    section, we discuss some general principles of effec-tive technology transfer.

    Technology TransferPeople often use the term technologyto refer to the

    body o