On Test Suite Composition and Cost-Efiective...
Transcript of On Test Suite Composition and Cost-Efiective...
![Page 1: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/1.jpg)
On Test Suite Composition and Cost-Effective Regression Testing.∗
Gregg Rothermel†, Sebastian Elbaum‡, Alexey Malishevsky†, Praveen Kallakuri‡, Xuemei Qiu†
†School of Electrical Engineeringand Computer Science
Oregon State UniversityCorvallis, Oregon
{grother, malishal, qiuxu}@cs.orst.edu
‡Department of Computer Scienceand Engineering
University of Nebraska - LincolnLincoln, Nebraska
{elbaum, pkallaku}@cse.unl.edu
August 30, 2003
Abstract
Regression testing is an expensive testing process used to re-validate software as it evolves. Variousmethodologies for improving regression testing processes have been explored, but the cost-effectivenessof these methodologies has been shown to vary with characteristics of regression test suites. One suchcharacteristic involves the way in which test inputs are composed into test cases within a test suite.This article reports the results of controlled experiments examining the affects of two factors in testsuite composition — test suite granularity and test input grouping — on the costs and benefits ofseveral regression-testing-related methodologies: retest-all, regression test selection, test suite reduction,and test case prioritization. These experiments consider the application of several specific techniques,from each of these methodologies, across ten releases each of two substantial software systems, usingseven levels of test suite granularity and two types of test input grouping. The effects of granularity,technique, and grouping on the cost and fault-detection effectiveness of regression testing under the givenmethodologies are analyzed. This analysis shows that test suite granularity significantly affects severalcost-benefits factors for the methodologies considered, while test input grouping has limited effects.Further, the results expose essential tradeoffs affecting the relationship between test suite design andregression testing cost-effectiveness, with several implications for practice.
1 Introduction
As software evolves, test engineers regression test it to validate new features and detect whether new faults
have been introduced into previously tested code. Regression testing is important, but also expensive, so
many methodologies for improving its cost-effectiveness have been investigated. Among these methodologies
are four that involve reuse of existing test cases. The retest-all methodology [26, 30] re-uses all previously
developed test cases, executing them on the modified program. Regression test selection (e.g., [7, 36]) re-
uses test cases too, but selectively, focusing on subsets of existing test suites. Test case prioritization (e.g.,
[11, 39, 40, 44]) orders test cases so that those that are better at achieving testing objectives are run earlier
in the regression testing cycle. Finally, test suite reduction (e.g., [6, 17, 29]) attempts to reduce future
regression testing costs by permanently eliminating test cases from test suites.∗Portions of this research have been previously presented in [33].
1
![Page 2: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/2.jpg)
The cost-effectiveness of specific techniques under these four methodologies varies with characteristics
of test suites [11, 37, 38]. One prominent factor in this variance involves the way in which test inputs are
composed into test cases within a test suite. For example:
• A test suite for a word processor might contain just a few test cases that start up the system, open a
document, issue hundreds of editing commands, and close the document, or it might contain hundreds
of test cases that each issue only a few commands.
• A test suite for a compiler might contain several test cases that each compile a source file containing
hundreds of language constructs, or hundreds of test cases that each compile source files containing
just a few constructs.
• A test suite for a class library might contain a few test drivers that each invoke dozens of methods, or
dozens of drivers that each invoke just a few methods.
These examples expose important choices in test suite design, and faced with such choices, test engineers
may wonder how best to proceed. Textbooks and articles on testing provide varying and sometimes contra-
dictory advice. Beizer [2, p. 51], for example, asserts: “It’s better to use several simple, obvious tests than
to do the job with fewer, but grander, tests.” Kaner et al. [23, p. 125] suggest that large test cases can save
time, provided they are not overly complicated, in which case simpler test cases may be more efficient. Kit
[25, p. 107] suggests that when testing valid inputs for which failures should be infrequent, large test cases
are preferable. Hildebrandt [20] argues that small test cases facilitate debugging. Bach [1] states that small
test cases cause fewer difficulties with cascading errors, but large test cases are better at exposing system
level failures involving interactions between software components.
Most of the foregoing statements refer to test case size, but the issues concerned are more complex. In
this article, we consider two specific characteristics of test suite composition: test suite granularity and test
input grouping. These characteristics pertain to the way in which test engineers group individual test inputs
into test cases within test suites. Test suite granularity pertains to the size of the test cases so grouped –
the number of inputs, or amount of input applied, per test case. Test input grouping pertains to the content
of test cases – the degree of hetero- or homogeneity among the inputs that compose a test case. (We define
these characteristics more precisely in Section 2, and provide precise measures for them in Section 3.2.1).
Despite the apparent importance of test suite composition and the apparent contradictions among state-
ments in the popular testing literature, in our search of the research literature we find little formal examina-
tion of the cost-benefits tradeoffs associated with test suite granularity and test input grouping. A thorough
investigation of these tradeoffs and the implications they hold for testing across the software lifecycle could
help test engineers design test suites that better support cost-effective regression testing.
We have therefore designed and performed a family of controlled experiments, examining the effects
of test suite granularity and test input grouping on the costs and benefits of the four regression-testing-
related methodologies mentioned above: retest-all, regression test selection, test suite reduction, and test
case prioritization. Our experiments consider the application of several techniques, under each of these
methodologies, across ten releases each of two substantial software systems, using seven different levels of
2
![Page 3: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/3.jpg)
test suite granularity and two different types of test input grouping. We measure and analyze the effects of
granularity, technique, and grouping on the costs of regression testing the systems as they evolve, and on
the fault-detection effectiveness of that regression testing.
Our results show that test suite granularity significantly affects several cost-benefits factors for the
methodologies considered, while test input grouping has limited effects. Further, our results expose several
essential tradeoffs affecting the relationship between test suite design and regression testing cost-effectiveness,
with several implications for practice.
In the following section we review the issues and the previous literature related to this work. Section 3
presents our experiment design, results, and analysis. Section 4 discusses the implications of our results, and
Section 5 summarizes and comments on future work.
2 Background and Related Work
One could certainly study the effects of test suite composition on the cost-effectiveness of test suites, focusing
on the testing of initial versions of new software systems. Such a study could provide data on the cost-
effectiveness of various types of test development strategies relative to initial system releases, a context that
is certainly important.
On our view, however, such a study would overlook a central facet of software system development.
Successful software systems are seldom developed and tested just once; rather, they evolve, and are re-tested
repeatedly across their lifetimes. A testing methodology that is effective for an initial system release, but that
complicates subsequent regression testing of the system as it evolves, may be less cost-effective overall than
a methodology that is initially expensive but amortizes initial testing costs over subsequent, cost-effective,
regression testing runs.
A fundamental thesis behind this work, therefore, is that testing cost and effectiveness are best assessed
relative to systems across their lifecycles. This means, among other things, that we must assess testing
techniques and test design choices relative to their effects on regression testing.
For this reason, in this work, we study the effects of test suite granularity and test input grouping on
testing activities in relation to regression testing.
In the following subsections, we provide more detailed discussion of test suite granularity and test input
grouping, we describe the particular regression testing activities on which we focus, and we discuss related
work on these topics.
2.1 Test Suite Granularity and Test Input Grouping
Following Binder [4], we define a test case to consist of a pretest state of the system under test (including
its environment), a sequence of test inputs, and a statement of expected test results. We define a test suite
to be a set of test cases.
Definitions of test suite granularity and test input grouping are harder to come by, but the testing problem
we are addressing is a practical one, so we begin by drawing on examples.
3
![Page 4: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/4.jpg)
Test engineers designing test cases for a system identify various testing requirements for that system, such
as specification items, code elements, or method sequences. Next, they must construct test cases that exercise
these requirements. An engineer testing a word processor might specify sequences of editing commands, an
engineer testing a compiler might create sample target-language programs, and an engineer testing a class
library might develop drivers that invoke methods. The practical questions these engineers face include how
many and which editing commands to include per sequence, how many and which constructs to include in
each target-language program, and how many and which methods to invoke per driver, respectively.
We wish to answer these questions, and the answers are likely to involve many factors. For example, if
the cost of performing setup activities for individual test cases dominates the cost of executing those test
cases, a test suite containing a few large test cases can be less expensive than a suite containing many small
test cases. Large test cases might also be better than small ones at exposing failures caused by interactions
among system functions. Small test cases, on the other hand, can be easier to use in debugging than large test
cases, because they reduce occurrences of cascading errors [1] and simplify fault localization [20]. Further, in
test cases composed of large numbers of test inputs, inputs occuring early in the test cases may prevent test
inputs that appear later in those test cases from exercising the requirements they are intended to exercise,
by causing subsequent test inputs to be applied from system states that differ from those intended.
In part, the foregoing examples involve test case size, a term used informally in [1, 2, 23, 25] to denote
notions such as the number of commands applied to, or the amount of input processed by, the program under
test, for a given test case. However, there is more than just test case size involved: when engineers increase
or decrease the number of requirements covered by each test case, this directly determines the number of
individual test cases that must be created to cover all the requirements. Thus, as expressed by Beizer [2],
the choice is not just between “large” and “small” tests, but between “several simple, obvious tests” and
“fewer, but grander, tests”.
The interaction of test case size and number of test cases is one plausible factor underlying the cost-
benefits tradeoffs described above. One phenomenon we wish to study in this article, then, involves the
effects that occur when test inputs are composed into specific size test cases in a test suite. We use the term
test suite granularity to describe a partition on a set of test inputs into a test suite containing test cases of
a given size. Section 3.2.1 presents a precise metric for this construct.
An additional factor that may influence the effects of choices in test suite design, however, involves the
relationship between the particular test inputs that are assembled into individual test cases. For example, a
typical approach in test development and automation is for test engineers to group together, into individual
test cases, test inputs that address similar functionality (for example, inputs related to a specific use case or
set of related functional requirements). This can be distinguished from approaches that group test inputs in
other ways, such as by engineer or team. We use the term test input grouping to describe this factor. Section
3.2.1 provides a precise metric for this construct.
As thus defined, test suite granularity concerns the sizes of individual test cases, but not their content,
and test input grouping concerns the content of individual test cases, but not their size. Together these
two terms represent test suite composition, but as we shall show, the two factors can be varied separately,
4
![Page 5: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/5.jpg)
allowing us to examine both their individual and combined roles in affecting the cost and effectiveness of
regression testing methodologies.
Other definitions of test case, test case size, test suite granularity, and test input grouping than those
used in this work could also be of practical interest. Test engineers might choose to view the individual
inputs applied during a single invocation of a word processor, or the individual method invocations made
from within a class driver, as individual test cases, each with its own size. Also, in practice, test suites may
contain test cases of varying sizes and with varying logic underlying groupings. As we show in Section 3,
however, our definitions facilitate the controlled study of the cost-benefits tradeoffs outlined above, allowing
us to investigate questions of causality not otherwise amenable to study.
2.2 Regression Testing and Regression-Testing-Related Methodologies
Let P be a program, let P ′ be a modified version of P , and let T be a test suite developed for P . Regression
testing is concerned with validating P ′.
To facilitate regression testing, engineers typically re-use T , but new test cases may also be required to
test new functionality. Both reuse of T and creation of new test cases are important; however, it is test case
reuse that concerns us here, as it is the desire to re-use test cases that motivates most suggestions about
costs and benefits of test suite granularity. In particular, we consider four methodologies related to regression
testing and test reuse: retest-all, regression test selection, test suite reduction, and test case prioritization.1
2.2.1 Retest-all
When P is modified, creating P ′, test engineers may simply reuse all non-obsolete test cases in T to test P ′;
this is known as the retest-all technique [26]. (Test cases in T that no longer apply to P ′ are obsolete, and
must be reformulated or discarded [26].) The retest-all technique represents typical current practice [30],
and thus, serves as our control technique.
2.2.2 Regression Test Selection
The retest all technique can be expensive: rerunning all test cases may require an unacceptable amount of
time or human effort. Regression test selection (RTS) techniques (e.g., [5, 7, 14, 27, 36, 41]) use information
about P , P ′, and T to select a subset of T with which to test P ′. (For a survey of RTS techniques, see [35].)
Empirical studies of some of these techniques [7, 15, 34, 37] have shown that they can be cost-effective.
One cost-benefits tradeoff among RTS techniques involves safety and efficiency. Safe RTS techniques
(e.g. [7, 36, 41]) guarantee that, under certain conditions, test cases not selected could not have exposed
faults in P ′ [35]. Achieving safety, however, may require inclusion of a larger number of test cases than can
be run in available testing time. Non-safe RTS techniques (e.g. [14, 16, 27]) sacrifice safety for efficiency,
selecting test cases that, in some sense, are more useful than those excluded. A special case among non-safe
techniques involves techniques that attempt to minimize the selected test suite relative to a fixed set of1There are also several other sub-problems related to the regression testing effort, including the problems of automating
testing activities, managing testing-related artifacts, identifying obsolete tests, and providing test oracle support [19, 26, 30].We do not directly address these problems here, although our results could have implications worth considering for them.
5
![Page 6: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/6.jpg)
coverage requirements and information on changes (e.g. [14]), seeking the lowest test execution cost possible
consistent with covering changed sections of code.
2.2.3 Test Suite Reduction
As P evolves, new test cases may be added to T to validate new functionality. Over time, T grows, and its
test cases may become redundant in terms of code or functionality exercised. Test suite reduction techniques2
[6, 17, 22, 29] address this problem by using information about P and T to permanently remove redundant
test cases from T , rendering later reuse of T more efficient. Test suite reduction thus differs from regression
test selection in that the latter does not permanently remove test cases from T , but simply “screens” those
test cases for use on a specific version P ′ of P , retaining unused test cases for use on future releases. Test
suite reduction analyses are also typically accomplished (unlike regression test selection) independent of P ′.
By reducing test-suite size, test-suite reduction techniques reduce the costs of executing, validating, and
managing test suites over future releases of the software. A potential drawback of test-suite reduction, how-
ever, is that removal of test cases from a test suite may damage that test suite’s fault-detecting capabilities.
Some studies [43] have shown that test-suite reduction can produce substantial savings at little cost to fault-
detection effectiveness. Other studies [38] have shown that test suite reduction can significantly reduce the
fault-detection effectiveness of test suites.
2.2.4 Test Case Prioritization
Test case prioritization techniques [11, 22, 39, 40, 44], schedule test cases so that those with the highest
priority, according to some criterion, are executed earlier in the regression testing process than lower priority
test cases. For example, testers might wish to schedule test cases in an order that achieves code coverage at
the fastest rate possible, exercises features in order of expected frequency of use, or increases the likelihood
of detecting faults early in testing.
Empirical results [11, 39, 44] suggest that several simple prioritization techniques can significantly im-
prove one testing performance goal; namely, the rate at which test suites detect faults. An improved rate of
fault detection during regression testing provides earlier feedback on the system under test and lets software
engineers begin addressing faults earlier than might otherwise be possible. These results also suggest, how-
ever, that the relative cost-effectiveness of prioritization techniques varies across workloads (programs, test
suites, and types of modifications).
Many different prioritization techniques have been proposed [10, 11, 39, 40, 44], but the techniques most
prevalent in literature and practice involve those that utilize simple code coverage information, and those
that supplement coverage information with details on where code has been modified. The latter approach
has been found efficient on extremely large systems at Microsoft [40], but the relative effectiveness of the
approaches has been shown to vary with several factors including characteristics of the test suite utilized
[13], further motivating experiments such as those reported in this article.2Test suite reduction has also been referred to, in the literature, as test suite minimization; however, the intractability of
the test suite minimization problem forces techniques to employ heuristics that may not yield minimum test suites; thus, weterm these techniques “reduction” techniques.
6
![Page 7: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/7.jpg)
2.3 Related Work
Many articles [7, 9, 15, 24, 38, 43] have examined the costs and benefits of retest-all, regression test selection,
test case prioritization, and test case reduction techniques. Several textbooks and articles on testing [1, 2,
9, 20, 23, 25, 38] have discussed tradeoffs involving test suite granularity. None of this literature, however,
describes any formal or empirical examinations of these tradeoffs.
In [34, 37], test suite granularity is specifically treated as a factor in two studies of regression test
selection, and test suites constructed from smaller test cases are shown to facilitate selection. These studies,
however, measured only numbers of test cases selected, considered only safe RTS techniques, and omitted
consideration of test input grouping. In contrast, this article presents the results of controlled experiments
designed specifically to examine the impact of test suite granularity and test input grouping on the costs
and savings associated with several regression testing methodologies and techniques, across several metrics
of importance.
In [33], we presented the results of an initial set of controlled experiments examining the effects of test
suite granularity on the retest-all, regression test selection, test suite reduction, and test case prioritization
methodologies. The experiments reported in this article extend those experiments in the following ways:
• The experiments in [33] treated test suite granularity, program, and technique as independent variables;
these experiments expand the set of independent variables considered to include test input grouping.
• The experiments in [33] utilized six versions each of two subject software systems; these experiments
expand the subject pool to ten versions of each of these systems.
• The experiments in [33] utilized four levels of test suite granularity; these experiments expand this to
seven levels.
• These experiments examine an additional regression test selection technique and an additional test
case prioritization technique, each representing important classes of techniques not considered in [33].
• These experiments utilize improved test oracles, providing a new view on fault detection results.
• The analysis of the results of the experiments in [33] considered only main effects; the analysis of the
results of these experiments also considers significant interactions.
• The discussion of the results obtained in these experiments utilizes an additional measure of fault-
detection effectiveness not considered in [33].
• The discussion of results considers not only general tendencies, but also the particular findings and
impact of those findings within each methodology.
The net effect of these changes is an expansion of the external, construct, and conclusion validity of the
results reported in [33], and a more thorough understanding of the effects of test suite composition than was
achievable through the earlier experiments alone.
7
![Page 8: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/8.jpg)
3 Experiments
Informally, our goal is to address the research question: “how do test suite granularity and test input grouping
affect the costs and benefits of regression testing methodologies?” More formally, we seek to evaluate the
following hypotheses (expressed as null hypotheses) for four methodologies — retest all, regression test
selection, test suite reduction, and test case prioritization — at a 0.05 level of significance:
H1 (test suite granularity): Test suite granularity does not have a significant impact on the
costs and benefits of regression testing techniques.
H2 (test input grouping): Test input grouping does not have a significant impact on the costs
and benefits of regression testing techniques.
H3 (technique): Regression testing techniques do not perform significantly differently in terms
of the selected costs and benefits measures.3
H4 (interactions): Test suite granularity and test input grouping effects across regression test-
ing techniques and programs do not significantly differ.
To test these hypotheses we designed several controlled experiments. The following subsections present,
for these experiments, our objects of analysis, independent variables, dependent variables and measures,
experiment setup and design, threats to validity, and data and analysis. Further discussion of the results
and their implications follows in Section 4.
3.1 Objects of Analysis: emp-server and bash
As objects of analysis we utilized ten releases each of two substantial C programs: emp-server and bash.
Emp-server is the server component of the open-source client-server internet game Empire. Emp-server
is essentially a transaction manager: its main routine consists of initialization code followed by an event
loop in which execution waits for receipt of a user command. Emp-server is invoked and left running on a
host system; a user communicates with the server by executing a client that transmits the user’s inputs
to it as commands. When emp-server receives a command, its event loop invokes routines that process the
command, then waits to receive the next command. As emp-server processes commands, it may return
data to the client program for display on the user’s terminal, or write data to a local database (a directory of
ASCII and binary files) that keeps track of the game’s state. The event loop and program terminate when a
user issues a “quit” command. Table 1 shows the numbers of functions and lines of executable code in each
of the ten versions of emp-server that we considered, and for each version after the first, the number of
functions changed for that version (modified or added to the version, or deleted from the preceding version).
Bash [32], short for “Bourne Again SHell”, is a popular open-source application that provides a command
line interface to multiple Unix services. Bash was developed as part of the GNU Project, adopting several
features from the Korn and C shells, but also incorporating new functionality such as improved command line
editing, unlimited size command history, job control, indexed arrays of unlimited size, and more advanced3This hypothesis has been tested in previous studies, and is included primarily for completeness and replication.
8
![Page 9: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/9.jpg)
Changed LinesProgram Version Functions Functions of Code
emp-server 4.2.0 1,188 — 63,014emp-server 4.2.1 1,188 51 63,014emp-server 4.2.2 1,197 245 63,658emp-server 4.2.3 1,196 157 63,937emp-server 4.2.4 1,197 9 63,988emp-server 4.2.5 1,197 101 64,063emp-server 4.2.6 1,197 32 64,108emp-server 4.2.7 1,197 156 64,439emp-server 4.2.8 1,189 52 64,381emp-server 4.2.9 1,189 12 64,396bash 2.0 1,494 — 48,292bash 2.01 1,537 238 49,555bash 2.01.1 1,538 40 49,666bash 2.02 1,678 197 58,090bash 2.02.1 1,678 12 58,103bash 2.03 1,703 152 59,010bash 2.04 1,890 267 63,648bash 2.05a 1,942 411 65,319bash 2.05b 1,949 34 65,433bash 2.05 1,950 20 65,474
Table 1: Experiment Subjects
integer arithmetic. Bash is still evolving; on average two new releases have emerged per year over the last
five years. The ten versions of bash that we used were released from 1996 to 2001 (see Table 1). Each release
corrects faults, but also provides new functionality as evident by the increasing code size.
3.2 Variables and Measures
3.2.1 Independent Variables
Our experiments manipulated three independent variables: regression testing technique, test suite granular-
ity, and test input grouping.
Regression Testing Technique
For each regression testing methodology considered other than retest-all, we studied several techniques. In
selecting techniques we had three goals: (1) to include techniques that could serve as practical experimental
controls, (2) to include techniques that could easily be implemented by practitioners, and (3) to include
techniques that exemplify the primary categories of available techniques (and in so doing, reflect the primary
potential tradeoffs among techniques).
Retest-all. There is just one retest-all technique: run all of the non-obsolete test cases in T on P ′. We
investigate the effects of test suite granularity and test input grouping on this technique. (The retest-
all technique also serves as a control technique in our evaluations of RTS and test suite reduction
methodologies, as it represents standard practice when those methodologies are not employed.)
Regression test selection. We selected four RTS techniques, retest-all, modified entity, modified non-core
entity, and minimization:
9
![Page 10: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/10.jpg)
• In this context retest-all is our control technique, representing the typical current practice of
selecting all non-obsolete test cases for re-execution.
• The modified entity technique [7] is a safe RTS technique: it selects test cases that exercise
functions, in P , that (1) have been deleted or changed in producing P ′, or (2) use variables or
structures that have been deleted or changed in producing P ′.
• The modified non-core entity technique [33] acts like the modified-entity technique, but ignores
“core” functions, defined as functions exercised by more than k% of the test cases in the test
suite. Following results of previous studies of technique effectiveness [3, 34], we set k to 80%.
This technique trades safety for savings in re-testing effort (selecting all test cases through core
functions may lead to selecting all of T ).
• The minimization technique [14] attempts to select a minimal set of test cases, from T , that
yields coverage of modified functions in P ′. This is necessarily an heuristic, as the technique uses
coverage information gathered from applying T to P to attempt to predict the functions that will
be covered in P ′.
Test suite reduction. We selected two test suite reduction techniques, no reduction and GHS reduction.
• The no reduction technique, equivalent to retest-all, represents current typical practice and serves
as our control.
• The GHS reduction technique is an heuristic presented by Gupta, Harrold, and Soffa [17] that
attempts to produce suites that are minimal for a given coverage criterion; we used a function
coverage criterion.
Test case prioritization. We selected three test case prioritization techniques: additional function cov-
erage, additional modified-function coverage, and optimal prioritization. These are described in detail
in [39], we summarize them here.
• Additional function coverage prioritization iteratively selects a test case that yields the greatest
function coverage, then adjusts the coverage information on subsequent test cases to indicate their
coverage of functions not yet covered, and then repeats this process until all functions covered by
at least one test case have been covered. The process then iterates on the remaining test cases.
• Additional modified-function coverage prioritization acts like additional function coverage prior-
itization, except that it initially attends only to functions that have been modified; after all test
cases executing one or more modified functions have been placed in the order, additional function
coverage prioritization is applied to the remaining test cases.
• Optimal prioritization uses information on which test cases in T reveal faults in P ′ to find an
approximate optimal ordering for T . Though not a practical technique (in practice we do not
know which test cases reveal which faults beforehand), this technique provides an upper bound
on prioritization benefits.
10
![Page 11: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/11.jpg)
Test Suite Granularity
To investigate the impact of test suite granularity on the costs and benefits of regression testing techniques,
we needed to obtain test suites of varying granularities, while controlling for other factors that might affect
our dependent measures.
We considered two approaches for doing this. The first approach is to obtain or construct test suites
for a program, partition them into subsets according to size, and compare the results of executing these
different subsets. A drawback of this approach, however, is that it will not let us determine whether a causal
relationship exists between test suite granularity and measures of costs or benefits, because it does not
control for other factors that might influence those measures. To see this, suppose that T can be partitioned
into two subsets, T1 and T2, where T1 contains test cases of size less than s and T2 contains test cases of
size greater than or equal to s. Suppose that we compare the costs or benefits of utilizing T1 and T2 and
find that they differ. In this case, we cannot determine whether this difference was caused by test suite
granularity or by differences in the number or type of inputs applied in T1 and T2. For example, the types
of functionality exercised by the inputs in T2 might happen to include all functionality modified to create
P ′, causing differences in performance between the two subsets to occur for reasons other than test case
granularity.
The second approach that we considered is to construct test suites of varying granularities by sampling
a single pool or “universe” of test grains. A test grain is a smallest input that could be used as a test case
(applied from a start state and producing a checkable output) for a target program. A sampling procedure can
select test grains to create test cases of different sizes: a test case of size s consists of s test grains. Applying
this sampling procedure randomly and repeatedly to a universe of n test grains, without replacement, until
no test grains remain (partitioning the universe into n/s test cases of size s, and possibly one smaller test
case), yields a test suite of granularity level s. Repeating this procedure for each of several values of s
provides test suites of different granularity levels that can be compared controlling for differences in types
and numbers of inputs.
We chose this second approach, and employed seven granularity levels: 1, 2, 4, 8, 16, 32 and 64, which
we refer to as G1, G2, G4, G8, G16, G32 and G64, respectively. To facilitate discussion, when referring to
granularity levels, we refer to test suites employing lower granularity level numbers as fine granularity test
suites, and test suites employing higher granularity level numbers as coarse granularity test suites.
Test Input Grouping
In our procedure for constructing test suites of different granularities, applying the sampling procedure
repeatedly to a universe of n test grains and sampling randomly across the whole universe each time (without
replacement) creates random grouping test cases. Such a grouping strategy, however, may not reflect the
way in which inputs are grouped into test cases in practice, and thus we also considered a second strategy
for grouping test inputs, which creates functional grouping test cases.
Functional grouping test cases are composed (to the extent possible) of inputs that exercise the same
functionality. To create functional grouping test cases, we first separated the test grains in the test universe U
11
![Page 12: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/12.jpg)
for each program P into “buckets”, where each bucket Bk contains the test grains in U targetting functionality
k in P . Given these buckets, we considered two approaches for creating functional grouping test cases of
granularity level s:
• From within each bucket, randomly select groups of s test grains without replacement until fewer than
s test grains remain in the bucket. Do this for each bucket. Collect any test grains remaining in any
buckets into a single pool, and from them, randomly select groups of size s test cases from this pool
until all have been selected.
• From within each bucket, randomly select groups of s test grains without replacement until fewer than
s test grains remain in the bucket. If any test grains remain in that bucket, let them constitute one
final group (of size less than s). Do this for each bucket.
The difference between these two approaches lies in their handling of test cases that remain in buckets
after the maximum possible number of groups of size s have been selected from those buckets. The first
approach has the drawback that, depending on the number and sizes of buckets, it may create a certain
number of functionally non-homogeneous test cases. The second approach has the drawback that it might
yield a large number of test cases of size less than s at each granularity level (potentially as many as one per
bucket). The presence of test cases of different sizes would make it impossible to draw conclusions about the
effects of granularity: we need to control for the number and size of test cases created at each granularity
level.
Thus, we selected the first approach as our grouping strategy. This strategy provides us with a set of test
cases, at each granularity level, equivalent in size to the set of test cases obtained with the random grouping
strategy, and lets us draw conclusions about the potential influence of functional grouping on granularity
effects. In interpreting our results we take care to consider functional non-homogeneity among our test cases.
3.2.2 Dependent Variables and Measures
To investigate our hypotheses we need to measure the costs and benefits of the various regression-testing-
techniques considered. To do this we constructed three models. Our first two models assess the costs and
benefits of retest-all, regression test selection and test case reduction, and our third model assesses the
benefits of test case prioritization.
Savings in Test Execution Time
Regression test selection and test suite reduction techniques achieve savings by reducing the number of test
cases that need to be executed on P ′, thereby reducing the effort required to retest P ′. The use of different
test suite granularities and test input groupings may also affect the savings in test execution and validation
time that can be achieved by selection, reduction, and retest-all. To evaluate these effects, we measure the
time required to execute and validate the outputs of the test cases in test suites, selected test suites, and
reduced test suites, across different granularities and groupings.
12
![Page 13: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/13.jpg)
Costs in Fault-Detection Effectiveness
One potential cost of regression test selection and test suite reduction is the cost of missing faults that
would have been exposed by test suites prior to selection or reduction. Missed faults could also occur due to
differences in test suite granularity or test input grouping, for these techniques and the retest-all technique.
Costs in fault-detection effectiveness can be measured by studying programs containing known faults.
When dealing with single faults, one common fault-detection effectiveness measure [15, 21] estimates, for
each test case t, whether t detects fault f in P ′, by applying t to two versions of P ′, one that contains f
and one that does not. If the outputs of P and P ′ (program outputs and contents of relevant external files)
differ on t, t is assumed to reveal f . Given this approach, the fault-detection effectiveness for a specific test
suite T can be measured by considering fault-detection effectiveness results for each test case t ∈ T .
In our experiments, however, we wish to study programs containing multiple faults. When P ′ contains
multiple faults it is not sufficient to note which test cases cause P and P ′ to produce different outputs, we
must also determine which test cases could contribute to revealing which faults. One way to do this [24] is
to instrument P ′ such that when t is run on P ′ we can determine, for each fault f in P ′, whether: (1) t
reaches f , (2) t causes a change in data state following execution of f , and (3) the output of P ′ on t differs
from the output of P on t.
One drawback of this approach is that it can underestimate the faults that could be found in practice
with t. To see this, suppose that P ′ contains faults f1 and f2, which can each be detected by t if present
alone. Suppose, however, that when f1 and f2 are both present in P ′, f1 prevents t from reaching f2. This
approach would suggest that t cannot detect f2. In a debugging process, however, an engineer might detect
and correct f1, and then on re-running t on the (partially) corrected P ′, detect f2. A second drawback of
this approach is that testing for data state changes can be extremely difficult in programs that manipulate
enormous data spaces, such as those we use in these experiments.
For these reasons, we chose a different approach. We activated each fault f in P ′ individually, executed
each test case t (at each granularity level) on P ′, and determined whether t detects f singly by noting
whether it causes P and P ′ to produce different outputs. We then assumed that detection of f when present
singly implies detection of f when present in combination with other faults.
This approach avoids the drawbacks of the first: it captures the results of an incremental fault-correction
process without requiring detection of data state changes. The approach may overestimate fault detection,
however, in cases where multiple faults would actually mask each other’s effects, causing no failures to occur
on t. We investigated the possible magnitude of this error in our study by also executing, at each granularity
level, all test cases on versions with all seeded faults activated, and measuring the extent to which test cases
that caused single-fault versions to fail did not cause multi-fault versions to fail.4 The data showed that for
emp-server, across all versions and granularities, masking occurred on only 339 of 70,992 test cases (0.48%),
and for bash, across all versions and granularities, it occurred on only 5 of 41,742 test cases (0.012%). We
thus considered masking a nuisance variable, posing only a minor threat to the validity of our experiments.4This check does not eliminate the possibility that some subset of the faults in a multi-fault version might mask one another,
and be undetected by test case t in that version even though detected singly by t; however, it is not computationally feasibleto check for this possibility.
13
![Page 14: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/14.jpg)
B. APFD for prioritized suite T1 C. APFD for prioritized suite T2 D. APFD for prioritized suite T3A. Test suite and faults exposed
0.2 0.4 0.6 0.8 1.0
0
0
10
20
30
40
50
60
70
80
90
100
Test Suite Fraction
Test Case Order: C-E-B-A-D
Per
cent
Det
ecte
d F
aults
0.2 0.4 0.6 0.8 1.00
10
20
30
40
50
60
70
80
90
Test Case Order: E-D-C-B-A
100
0
Test Suite Fraction
Per
cent
Det
ecte
d F
aults
APFD = 84%APFD = 64%
1 2 3 4 5 6 7 8 9 10x x x xx x x x x x x x x x x
ABCDE
test fault
0.6 0.8
10
20
30
40
50
60
70
80
90
Test Suite Fraction
100
0
0 0.2 0.4 1.0
Test Case Order: A-B-C-D-E
Per
cent
Det
ecte
d F
aults
APFD = 50%
Figure 1: Examples illustrating the APFD metric.
Savings in Rate of Fault Detection
The test case prioritization techniques we consider have a goal of increasing a test suite’s rate of fault
detection. We wish to determine whether test suite granularity and test input grouping affect the ability of
prioritization technique’s to achieve this goal. To measure rate of fault detection, we use a metric APFD,
introduced for this purpose in [39], that measures the weighted average of the percentage of faults detected
over the life of a test suite. APFD values range from 0 to 100; higher numbers imply faster (better) fault
detection rates. More formally, let T be a test suite containing n test cases, and let F be a set of m faults
revealed by T . Let TFi be the index of the first test case in ordering T ′ of T that reveals fault i. The APFD
for test suite T ′ is given by the equation:
APFD = 1 − TF1 + TF2 + ... + TFm
nm+
12n
To obtain an intuition for this metric, consider an example program with 10 faults and a test suite of
5 test cases, A through E, with fault detecting abilities as shown in Figure 1.A. Suppose we place the test
cases in order A–B–C–D–E to form prioritized test suite T1. Figure 1.B shows the percentage of detected
faults versus the fraction of T1 used. After running test case A, 2 of the 10 faults are detected; thus 20%
of the faults have been detected after 0.2 of T1 has been used. After running test case B, 2 more faults are
detected and thus 40% of the faults have been detected after 0.4 of the T1 has been used. The area under
the curve represents the weighted average of the percentage of faults detected over the life of the test suite.
This area is the prioritized test suite’s average percentage faults detected metric (APFD); the APFD is 50%
in this example.
Figure 1.C reflects what happens when the order of test cases is changed to E–D–C–B–A, yielding a
“faster detecting” suite than T1 with APFD 64%. Figure 1.D shows the effects of using a prioritized test
suite T3 whose test case order is C–E–B–A–D. By inspection, it is clear that this order results in the
earliest detection of the most faults and illustrates an optimal order, with APFD 84%.
14
![Page 15: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/15.jpg)
emp-server bash
G1 1985 1168G2 993 584G4 497 292G8 249 146G16 125 73G32 63 37G64 32 19
Table 2: Test Cases per Granularity Level
3.3 Experiment Setup
3.3.1 Test Cases and Test Automation
To examine our research question we required test cases for emp-server and bash. These test cases needed
to be realistic, but also needed to facilitate the controlled investigation of the effects of test suite granularity
and test input grouping following the methodology outlined in Section 3.2.1. The approaches we used to
create and automate these test cases, which differed between our programs, were as follows.
Emp-server Test Cases and Test Automation
No test cases were available for emp-server. To construct test cases we used the Empire information files,
which describe the 196 commands recognized by emp-server and the parameters and environmental effects
associated with each. We treated these files as informal specifications for system functions and used them,
together with the category partition method [31], to construct a suite of test cases for emp-server that
exercise each parameter, environmental effect, and erroneous condition described in the files.
We deliberately created the smallest test cases possible, each using the minimum number of commands
necessary to cover its target requirement. Each test case consists of a sequence of between one and six lines
of characters (average 1.2 lines per test case), and constitutes a sequence of inputs to the client, which the
client passes to emp-server. Because the complexity of commands, parameters, and effects varies widely
across the various Empire commands, this process yielded between one and 38 test cases for each command,
and ultimately produced 1985 test cases. These test cases constituted our test grains, as well as our test
cases at granularity level G1. We then used the two sampling procedures described in Section 3.2.1 to create
random and functional grouping test suites at granularity levels G2, G4, G8, G16, G32, and G64, the sizes
of which are shown in Table 2.
The test cases for emp-server fell naturally into buckets distinguished by command, yielding 196 buckets
with an average size of 12 test cases apiece. No buckets had size greater than 64, and few had sizes greater than
16. Thus, test cases created by our sampling procedure for emp-server become less functionally homogeneous
as granularity level increases (see the discussion of this issue in Section 3.2.1.) Table 3 illustrates, for each
granularity level, the percentage of purely functionally homogenous test cases present in the functional
grouping test suites at that level. When analyzing our results we take care to consider this data.
To execute and validate test cases automatically, we created test scripts. Given test suite T , for each
test case t in T these scripts: (1) initialize the Empire database to a start state; (2) invoke emp-server;
15
![Page 16: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/16.jpg)
Program G2 G4 G8 G16 G32 G64
emp-server 95.0 89.0 72.0 35.0 12.0 00.0bash 95.0 98.0 95.5 90.4 78.4 63.2
Table 3: Percentages of Purely Homogeneous Test Cases Present in Functional Groupings.
(3) invoke a client and issue the sequence of inputs that constitutes the test case to the client, saving all
output returned to the client; (4) terminate the client; (5) shut down emp-server; (6) save the contents of
the database for use in validation; and (7) compare saved client output and database contents with those
archived for the previous version, using a refined version of the Unix “diff” utility. By design, this process
lets us apply (in step 3) all of the test inputs contained in a test case, at all granularity levels.
Bash Test Cases and Test Automation
Each version of bash that we utilized had been released with a test suite, composed of test cases from previous
versions and new test cases designed to validate added functionality. We could not directly use these suites
for our experiment, because they were composed strictly of large test cases, each exercising whole functional
components. Further, the test suites executed, on average, only 33% of the functions in bash.
We thus created regression test suites for bash as follows. First, we partitioned each large test case that
came with bash release 2.0 into the smallest possible test grains. (We used the test cases from release 2.0
because they all function across all releases, whereas test cases added on subsequent releases do not function
on earlier ones, and a uniform application of test cases across all versions is needed to facilitate comparison.)
Second, to exercise functionality not covered by the original test suite, we created additional small test cases
by using the reference documentation for bash [32] as an informal specification.
The resulting test suite contains 1168 test cases, exercising an average of 64% of the functions across all
the versions. Each test case in the new test suite contains between one and 54 lines. Each line constitutes an
instruction consisting of bash or Expect [28] commands5 that can be executed on an instance of bash. The
1168 test cases constituted our test grains, and test cases at granularity level G1. As with emp-server, we
then followed the procedure described in Section 3.2.1 to create random and functional grouping test suites
at granularity levels G2, G4, G8, G16, G32, and G64, as reported in Table 2.
As with emp-server, our sampling procedure, applied to bash, did create some test cases that were not
homogeneous. For bash, however, the number of buckets identified (18) was far smaller, and average bucket
size (64) much larger, than for emp-server. Thus, functional grouping test cases were more frequently
functionally homogeneous for bash than for emp-server (see Table 3). When analyzing our results we take
care to consider this fact.
3.3.2 Faults
We wished to evaluate the performance of regression-testing-related methodologies with respect to detection
of regression faults – faults created in a program version as a result of the modifications that produced
that version. Emp-server and bash were not equipped, however, with fault logs of detail sufficient to let5Expect scripts were used for test cases exercising features of bash that required interaction.
16
![Page 17: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/17.jpg)
us locate actual regression faults (a problem typical in the use of open-source software in experimentation).
Thus, following a procedure described in [21], we seeded faults. We asked several graduate and undergraduate
computer science students, each with at least two years experience programming in C and unacquainted with
the details of this study, to become familiar with the programs and insert regression faults into the versions.
The fault seeders were told to insert faults that were as realistic as possible based on their experience with
real programs, and that involved code deleted from, inserted into, or modified in the versions.
To further direct their efforts, the fault seeders were given the following list of types of faults to consider:
• Faults associated with variables, such as with definitions of variables, redefinitions of variables, deletions
of variables, or changes in values of variables in assignment statements.
• Faults associated with control flow, such as addition of new blocks of code, deletions of paths, redefi-
nitions of execution conditions, removal of blocks, changes in order of execution, new calls to external
functions, removal of calls to external functions, addition of functions, or deletions of functions.
• Faults associated with memory allocation, such as not freeing allocated memory, failing to initialize
memory, or creating erroneous pointers.
Given ten potential faults seeded in each version of each program, we activated these faults individually,
and executed the test suites (at each granularity level) for the programs to determine which faults could
be revealed by which test cases, following the process outlined in Section 3.2.2. We excluded any potential
faults that were not detected by any test cases at any granularity level: such faults are meaningless to our
measures and cannot influence our results. We also excluded any faults that, at every granularity level, were
detected by more than 80% of the test cases; our assumption was that such easily detected faults would be
detected by test engineers during their unit testing of modifications (only five faults fell into this category).
Excluding faults detected by greater than 80% of the test cases in some, as opposed to every, level would
be inappropriate: the exclusion rule must be uniform across levels to avoid biasing results in favor of faults
that are detected differently at different levels. When this process was complete, 159 faults remained across
all versions of both programs.
3.3.3 Additional Instrumentation
To perform our experiments we required additional instrumentation. Our test coverage and control-flow
graph information was provided by the Aristotle program analysis system [18] and by the Clic instrumentor
and monitor [12]. We created test case prioritization, test suite reduction, and regression test selection tools
implementing the techniques described in Section 3.2.1. We used Unix utilities and direct inspection to
determine modified functions, or functions using modified structures.
All timing-related data was gathered on a SunUltra 60 with 512 MB of memory. While timing data was
being collected, our testing processes were the only active user processes on the machines.
17
![Page 18: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/18.jpg)
3.4 Experiment Design and Analysis Strategy
To address our hypotheses we designed four sets of experiments for each program, each with the same format.
These experiments evaluate the hypotheses for retest-all, regression test selection, test suite reduction, and
test case prioritization, respectively. In addition, each experiment has three factors with multiple levels to
ensure unbiased treatment assignment. We employ a Randomized Factorial (RF) design that has 2 levels for
grouping strategy, 7 levels for granularity, and a varying number of techniques depending on the particular
experiment. Each design cell has nine observations, corresponding to each of the versions (after the base
version) from each program under each treatment combination. These versions constitute random effects
that we do not control, and we consider them samples from a population of program versions.
The choice of a factorial design was based on the power of analysis offered by its treatment combinations,
which lets us interpret not only the main factors but also their interactions. The incorporation of three factors
was aimed at decreasing the variability of the results by controlling more independent variables, while at the
same time increasing the generalizability of the results by observing various scenarios that might be present
in the real world. We analyze emp-server and bash separately to reduce the impact of program related
factors that we did not fully control (e.g. software evolution, differences in test suites) on the results.
From the standpoint of empirical methodologies, it is interesting to note that such a factorial design
is often avoided in other disciplines due to the costs of obtaining “subjects” for all possible combinations
of independent variables. Since our “subjects” were programs and we had automated a large part of the
experiment, we were able to gather the data necessary to comply with such a design. Still, given the effort
involved in preparing program versions (ranging, approximately, from 80 to 300 hours per version) we wanted
to detect meaningful effects with a minimal number of invested resources. We decided to conservatively
determine sample size by doubling the number of versions used in the first instantiation of this study [33]
where significance was detected for at least one of the factors we are studying here.
3.5 Threats to Validity
Any controlled experiment is subject to threats to validity, and these must be considered in order to assess
the meaning and impact of results (see [42] for a general discussion of validity evaluation and a threats
classification). In this section we describe the internal, external, construct, and conclusion threats to the
validity of these experiments, and the approaches we used to limit their impact.
3.5.1 Internal Validity
To test our hypotheses we had to conduct experiments requiring a large number of processes and tools.
Some of these processes (e.g., fault seeding) involved programmers and some of the tools were specifically
developed for the experiments, all of which could have added variability to our results increasing threats to
internal validity. We used several procedures to control these sources of variation. For example, the fault
seeding process was performed following a specification so that each programmer operated in a similar way,
and it was performed in two locations using different groups of programmers. Also, we validated new tools
18
![Page 19: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/19.jpg)
by testing them on small sample programs and test suites, refining them as we targeted the larger programs,
and cross validating them across labs.
Having only one test suite for each test input grouping type at each granularity level per program is also a
potential threat to internal validity. Although the use of multiple test suites would have been preferable, the
expense of creating such suites was prohibitive. Our process for generating coarser granularity test suites,
however, involved randomly selecting and joining test grains, reducing the chances of bias caused by test
suite composition.
Our handling of masking effects, described in Section 3.2.2, might constitute a further threat to internal
validity; however, as noted there, our analysis suggests that such effects occur infrequently among the test
cases we utilized.
3.5.2 External Validity
Three issues affect the generalization of our results. The first issue is the quantity and quality of programs
studied. Although using only two programs lessens the external validity of the results, the relatively consistent
results we obtain for bash and emp-server suggest that the results may generalize. Further, we are able to
study a relatively large number of actual, sequential releases of these systems. Regarding program quality,
there is a large population of C programs of similar size. For example, the linux RedHat 7.1 distribution
includes source code for 394 applications; the average size of these applications is 22,104 non-comment lines
of code, and 19% have sizes between 25 and 75 KLOC, similar to the programs studied in our experiment.
Nevertheless, replication of these studies on other programs could increase the confidence in our results, and
help us investigate other factors.
The second issue involves fault representativeness. Our fault seeding process helped us control for threats
to internal validity that must be controlled in order to examine causal factors; however, faults and fault
patterns may differ in practice, and additional studies of additional fault populations are needed.
The third limiting factor is test process representativeness. Although the random and functional grouping
procedures we employed to obtain coarser granularity test suites are powerful in terms of control, they
constitute simulations of the testing procedures used in industry, and this might also impact the generalization
of our results. Complementing these controlled experiments with case studies on industrial test suites, though
sacrificing internal validity, could be helpful.
3.5.3 Construct Validity
The three dependent measures that we have considered are not the only possible measures of the costs and
benefits of regression testing methodologies. Our measures ignore the human costs that can be involved in
executing, auditing and managing test suites. Our measures do not consider debugging costs such as the
difficulty of fault localization, which could favor fine granularity test suites [20]. Our measures also ignore the
analysis time required to select or prioritize test cases, or reduce test suites. Previous work [34, 38, 39] has
shown, however, that for the techniques considered, either analysis time is much smaller than test execution
time, or analysis can be accomplished automatically in off-hours prior to the critical regression testing period
(thus, having no effect on cost-benefits).
19
![Page 20: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/20.jpg)
3.5.4 Conclusion Validity
The number of programs and versions we considered was large enough to show significance for most of the
techniques we studied in most, but all, cases. Although the use of more versions would have increased the
power of the experiment, the average cost of preparing each version ranged from 80 to 300 hours, limiting
the cost-effectiveness of taking additional observations.
3.6 Data and Analysis
In the following sections we investigate the effects of test suite granularity and grouping strategy on our four
regression testing methodologies, in turn, employing descriptive and inferential statistics.
3.6.1 Retest-All
We begin by exploring the impact of test suite granularity and grouping strategy on the retest-all technique.
Figure 2 summarizes the fault detection effectiveness (leftmost pair of graphs) and test execution time
(rightmost pair of graphs) observed per program as granularity level increases, for both grouping strategies.
Each graph contains seven data points per program, with each point representing the average, across all nine
modified versions of the given program, of the metric (fault detection effectiveness or test execution time)
being graphed. We join the data points with lines to assist interpretation.
The leftmost pair of graphs show that the fault detection effectiveness of the test suites remained nearly
constant for both programs, independent of changes in granularity level or grouping strategy. In total, only
three cases occurred in which faults detected at lower granularity levels were lost at granularity level G32,
and only two cases occurred in which faults detected at lower granularity levels were lost at granularity level
G64 (too few to be visible in the graphs). The test suites for the programs, used in their entirety, were
almost always powerful enough — across all granularities and on all versions — to detect all of the faults in
the programs. We will have more to say about this in Sections 4.1 and 4.5, in our discussion of results.
The rightmost pair of graphs show that test execution time decreased as granularity level increased,
independent of grouping strategy or program. For example, under the random grouping strategy, from
granularity level G1 to granularity level G64, test execution time decreased, for bash, from 782 minutes to
222 minutes, and for emp-server, from 505 minutes to 26 minutes.
We formally investigated these tendencies relative to our hypotheses by performing an analysis of variance
(ANOVA) for each program. The presentation of the ANOVA results includes the sources of variation
considered, and for each program, the sum of squares, degrees of freedom, mean squares, F value, and
p-value for each source. Because we set alpha to 0.05, and the p-value represents the smallest level of
significance that would lead to the rejection of a null hypothesis, we reject an hypothesis when p is less
than alpha. The results (Table 4) are consistent for both programs, indicating that granularity level, but
not grouping strategy, significantly affected execution time. The data showed no evidence of significant
interactions between the independent variables for either program.
20
![Page 21: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/21.jpg)
Random
1 2 4 8 16 32 640
20
40
60
80
100%
Fa
ults D
ete
cte
d
Functional
1 2 4 8 16 32 64
Random
1 2 4 8 16 32 640
200
400
600
800
1000
Exe
cu
tio
n T
ime
(m
inu
tes)
Functional
1 2 4 8 16 32 64
emp-server
bash
Figure 2: Fault detection effectiveness for random and functional grouping strategies (leftmost pair ofcolumns) and test execution time for random and functional grouping strategies (rightmost pair of columns)for the retest-all technique, across granularity levels (x-axis), averaged across versions.
Technique: retest-all
Variable: Test execution time.Emp-server Bash
Source SS DF MS F p SS DF MS F pGranularity 3268098 6 544683 5693.41 0.00 4635964 6 772661 19.13 0.00Grouping 199 1 199 2.08 0.15 131338 1 131338 3.25 0.07Granularity*Grouping 367 6 61 0.64 0.70 70586 6 11764 0.29 0.94Error 10715 112 96 4522624 112 40381
Table 4: Retest-all ANOVA.
3.6.2 Regression Test Selection
To facilitate the comparison of regression test selection techniques to each other and to the retest-all tech-
nique, we depict the data on these techniques together in Figure 3. The graphs in the first row present
results for the retest-all technique, and the other rows present results for the three RTS techniques.
21
![Page 22: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/22.jpg)
rete
st-
all
Random
1 2 4 8 16 32 640
20
40
60
80
100
% F
aults D
ete
cte
d
Functional
1 2 4 8 16 32 64
Random
1 2 4 8 16 32 640
200
400
600
800
1000
Execution T
ime (
min
ute
s)
Functional
1 2 4 8 16 32 64
modifie
d e
ntity
Random
1 2 4 8 16 32 640
20
40
60
80
100
% F
aults D
ete
cte
d
Functional
1 2 4 8 16 32 64
Random
1 2 4 8 16 32 640
200
400
600
800
1000
Execution T
ime (
min
ute
s)
Functional
1 2 4 8 16 32 64
modifie
d n
on-c
ore
entity
Random
1 2 4 8 16 32 640
20
40
60
80
100
% F
aults D
ete
cte
d
Functional
1 2 4 8 16 32 64
Random
1 2 4 8 16 32 640
200
400
600
800
1000
Execution T
ime (
min
ute
s)
Functional
1 2 4 8 16 32 64
min
imiz
ation
Random
1 2 4 8 16 32 640
20
40
60
80
100
% F
aults D
ete
cte
d
Functional
1 2 4 8 16 32 64
Random
1 2 4 8 16 32 640
200
400
600
800
1000
Execution T
ime (
min
ute
s)
Functional
1 2 4 8 16 32 64
emp-server bash
Figure 3: Fault detection effectiveness for random and functional grouping strategies (leftmost pair ofcolumns) and test execution time for random and functional grouping strategies (rightmost pair of columns)for retest-all and RTS techniques, across granularity levels (x-axis), averaged across versions.
22
![Page 23: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/23.jpg)
As the graphs indicate, the modified entity technique exhibited the same trends as the retest-all technique,
retaining fault detection effectiveness across granularity levels, and exhibiting a large reduction in the amount
of time required to re-execute the test suite as granularity level increased. The reason for this behavior is
that the location of changes in these particular program versions caused this safe RTS technique to require
execution of all existing test cases, because all test cases traversed code changed for the new version.
The modified non-core entity technique displayed different behavior. With this technique, for both
grouping strategies and at several granularity levels, faults were left undetected. For the random grouping
strategy, fault-detection effectiveness increased, from granularity level G1 to level G64, by approximately
14% for emp-server and 10% for bash. For the functional grouping strategy this same tendency occurred
for emp-server, but not for bash, for which fault-detection effectiveness varied widely across granularity
levels.
Fault-detection effectiveness results ran contrary to our intuitions; we had expected fault-detection ef-
fectiveness for bash to increase as granularity level increased, for the modified non-core entity technique,
because the technique excludes fewer test cases at higher granularity levels than at lower ones. Further
analysis of the data suggests that this difference between bash and emp-server arose due to differences in
the difficulties of exposing the faults in the programs. All but one of bash’s faults were exposed by fewer than
1% of that program’s granularity level 1 test cases, whereas only 23% of emp-server’s faults were exposed
by fewer than 1% of that program’s granularity level 1 test cases. We return to this issue in Section 4.
Test execution time with the modified non-core entity technique decreased as granularity level increased,
though by a smaller amount than occurred for the retest-all and modified-entity techniques. This difference
is due to the fact that the modified non-core entity technique selects fewer test cases at lower granularity
levels than at higher ones (in general, a given fine-granularity test case is less likely to encounter changes than
a given coarse-granularity test case.) For example, for emp-server under the random grouping strategy, the
modified non-core entity technique selected on average 35% of the test cases at granularity level G1, 68% at
level G4, and 96% at level G64.
We also observe that at granularity levels G2 through G8 on emp-server, and G4 through G32 on bash,
the functional grouping strategy appears to be associated with somewhat lower test execution times than the
random grouping strategy, for the modified non-core entity technique. These granularity levels are all levels
at which over 70% of the functional grouping test cases are homogeneous. The difference in performance
across grouping strategies can be attributed to the fact that homogeneous functional grouping test cases
are more likely than randomly grouped test cases to have similar code coverage characteristics. When code
modifications are limited, the number of test cases encountering those modifications (and thus the number
of test cases selected by a modified non-core entity RTS technique) will be less when the individual test
grains encountering modifications have been collected together into a few test cases, rather than distributed
randomly across many test cases.
Finally, the minimization RTS technique (fourth row of Figure 3) exhibited different behavior. First, we
observe greater variation in the percentage of faults detected with this technique than with the other RTS
techniques. Fault detection effectiveness also seems to have less consistently increasing and more variable
23
![Page 24: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/24.jpg)
Random
1 2 4 8 16 32 640
20
40
60
80
100
Execution T
ime (
min
ute
s)
Functional
1 2 4 8 16 32 64
emp-server bash
Figure 4: Minimization RTS technique execution times averaged across all versions of each program.
tendencies for bash than for emp-server, arguably due to the larger number of relatively difficult-to-detect
faults in bash.
Considering test execution time for the minimization RTS technique, on both programs, differences in
execution time across granularity levels (and consequently, differences in the savings achievable through
minimization across granularity levels) were not large for emp-server, independent of grouping strategy.
However, trends in execution time differ between the two programs, as can be more clearly seen in Figure 4,
which presents the same data with the y-axis scale modified. For the random grouping strategy, emp-server
exhibits little difference in test execution time across granularity levels. On bash, however, test execution
time increases (for the random grouping strategy) from 18 minutes at granularity level G1, to 25 minutes
at level G8, and to 81 minutes at level G64. This difference is likely attributable to differences in coverage
achieved by the test cases for the two programs. Bash’s granularity level G1 test cases, on average, exercise
larger and more varied sets of functions than emp-server test cases, and as these are combined into coarser
granularity test cases, the opportunities for culling out redundancies among those test cases decrease more
rapidly. To summarize, it seems that increases in granularity level can have a negative effect on the ability of
the minimization RTS technique to provide savings, but these results also depend on the program or, more
directly, the coverage patterns achieved on that program by its test cases.
To formally determine whether the impact of test suite granularity and grouping strategy on our depen-
dent variables was statistically significant — corresponding to our first two hypotheses — we performed an
24
![Page 25: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/25.jpg)
Techniques: modified non-core entity and retest-all
Variable: Fault-detection effectiveness.Emp-server Bash
Source SS DF MS F p SS DF MS F pGranularity 9 6 1 3.97 0.00 28 6 5 0.53 0.78Grouping 0 1 0 1.08 0.30 12 1 12 1.37 0.24Technique 8 1 8 20.87 0.00 70 1 70 8.00 0.01Granularity*Grouping 3 6 0 1.23 0.29 35 6 6 0.67 0.67Granularity*Technique 11 6 2 4.97 0.00 5 6 1 0.09 1.00Grouping*Technique 1 1 1 1.55 0.21 1 1 1 0.08 0.78Gran.*Group.*Tech. 1 6 0 0.40 0.88 3 6 0 0.05 1.00Error 82 224 0 1967 224 9Variable: Test execution time.
Emp-server BashSource SS DF MS F p SS DF MS F pGranularity 2851876 6 475313 236.02 0.00 2959544 6 493257 12.55 0.00Grouping 3944 1 3944 1.96 0.16 154812 1 154812 3.94 0.05Technique 402157 1 402157 199.70 0.00 1629161 1 1629161 41.44 0.00Granularity*Grouping 4837 6 806 0.40 0.88 119699 6 19950 0.51 0.80Granularity*Technique 758867 6 126478 62.80 0.00 1779040 6 296507 7.54 0.00Grouping*Technique 1835 1 1835 0.91 0.34 14175 1 14175 0.36 0.55Gran.*Group.*Tech. 2121 6 353 0.18 0.98 3533 6 589 0.02 1.00Error 451100 224 2014 8807271 224 39318
Table 5: Retest-all and modified non-core entity ANOVAs.
analysis of variance. The analysis considers all the factors utilized in the ANOVA for the retest-all technique,
and also incorporates technique as a new source of variation. Because the retest-all technique constitutes
the control technique for regression test selection, we paired it with each of the other RTS techniques to
determine whether those techniques’ effects on the dependent variable were significantly different than the
retest-all technique’s effect, and to determine whether the technique variable was more susceptible than oth-
ers to interactions with other sources of variation. (Because the data for the retest-all and modified-entity
techniques were nearly identical, we omit the comparison between these techniques.)
Table 5 presents the results of this analysis on each program applied to the retest-all and modified
non-core entity techniques. Considering fault detection effectiveness, the results on emp-server indicate
that granularity level and technique did have statistically significant impacts, matching our observations on
Figure 3. On bash, only technique exhibited significance; this was evident in Figure 3 where the retest-
all technique did not exhibit any variation for this dependent variable. Where the lack of significance for
granularity level on bash is concerned, the amount of variance in fault detection effectiveness across versions
could have limited our ability to detect a significant effect with the current number of observations in spite
of the tendencies observed in the graph.
We also found that for emp-server, though not for bash, the interaction between granularity level and
technique was significant, indicating that the impact of granularity level on fault detection effectiveness
differed depending on the technique utilized.
To better understand this interaction and to identify significant differences between means we performed
a Bonferroni multiple comparison analysis. This approach provides a post-hoc comparison of the effects’
means while controlling for the family-wise type of error. Table 6 presents the results of this analysis for all
combinations of granularity level and technique interaction, sorted by the mean fault-detection effectiveness
25
![Page 26: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/26.jpg)
Emp-serverSource: Granularity * TechniqueDependent Variable: Fault Detection EffectivenessGranularity Technique Mean Homogeneous Groups1 modified non-core entity 8.67 A8 modified non-core entity 9.67 B2 modified non-core entity 9.72 B4 modified non-core entity 9.72 B16 modified non-core entity 9.83 B64 modified non-core entity 9.83 B32 modified non-core entity 9.83 B64 retest-all 9.83 B32 retest-all 9.89 B1 retest-all 10.00 B2 retest-all 10.00 B4 retest-all 10.00 B8 retest-all 10.00 B16 retest-all 10.00 B
Table 6: Bonferroni results: Emp-server, granularity * technique, fault detection effectiveness, modifiednon-core entity and retest-all.
of the combinations from smallest to largest. Combinations sharing a given letter in the “Homogeneous
Groups” column belong to the same statistically homogeneous group; combinations not sharing letters are
significantly different. Overall, the analysis shows that for the retest-all technique, changes in granularity
level did not impact fault detection, whereas for the modified non-core entity technique, at the lowest
granularity level, fault detection was significantly smaller than at higher levels.6
Returning to the ANOVAs (Table 5) to consider test execution time, on both programs the same three
factors — granularity level, technique, and their interaction — had statistically significant impact. This
validates our observations that different techniques appeared to be affected in different ways as granularity
level increased, with low granularity levels exposing greater differences between techniques. At higher gran-
ularity levels, reduced execution time savings for RTS techniques, and lower-cost coarser test cases, allowed
the retest-all technique to perform comparably to the RTS techniques. This analysis is confirmed by Bonfer-
roni analyses (Table 7). Results for bash show that at higher granularity levels both techniques performed
similarly, whereas at lower levels (G1 and G2) the retest-all technique was inferior. Further, the ANOVAs
did not reveal significance in the effects of grouping strategy, and thus did not support our observation about
the possible superiority of functional grouping over random in supporting lower test execution times.
Finally, Table 8 presents ANOVA results from the comparison of the retest-all and minimization RTS
techniques. The results on test execution time show significance for the same factors as in the analysis
of the modified non-core entity technique, for both programs. However, the results on fault detection
effectiveness show significance, for both programs, only for technique. As observed earlier, the minimization
RTS technique displayed a large amount of variation in fault detection effectiveness as granularity level
and grouping strategy changed. This variance could be attributable to the known influence of location and
magnitude of changes on the effectiveness of minimization techniques [8].6We include in the text only the Bonferroni results that contribute some new insight into the data. The remainder of
the tables for interaction analysis using Bonferroni, for all cases in which the ANOVAs showed that interaction effects weresignificant, are given in Appendix A.
26
![Page 27: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/27.jpg)
BashSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups64 modified non-core entity 181.89 A16 modified non-core entity 189.72 A32 modified non-core entity 196.44 A64 retest-all 212.78 A8 modified non-core entity 216.06 A32 retest-all 230.67 A4 modified non-core entity 231.11 A16 retest-all 240.83 A1 modified non-core entity 252.00 A2 modified non-core entity 260.44 A8 retest-all 300.33 A B4 retest-all 366.44 A B2 retest-all 520.06 B1 retest-all 782.22 C
Table 7: Bonferroni results: Bash, granularity * technique, test execution time, modified non-core entity andretest-all.
Techniques: Minimization and Retest-all
Variable: Fault-detection effectiveness.Emp-server Bash
Source SS DF MS F p SS DF MS F pGranularity 13 6 2 1.21 0.30 50 6 8 1.00 0.43Grouping 1 1 1 0.65 0.42 18 1 18 2.22 0.14Technique 242 1 242 137.25 0.00 459 1 459 55.38 0.00Granularity*Grouping 6 6 1 0.59 0.74 30 6 5 0.60 0.73Granularity*Technique 18 6 3 1.72 0.12 18 6 3 0.37 0.90Grouping*Technique 1 1 1 0.81 0.37 0 1 0 0.00 1.00Gran.*Group.*Tech. 4 6 1 0.37 0.90 9 6 1 0.18 0.98Error 395 224 2 1855 224 8Variable: Test execution time.
Emp-server BashSource SS DF MS F p SS DF MS F pGranularity 1655112 6 275852 4221.88 0.00 1954389 6 325731 15.38 0.00Grouping 50 1 50 0.76 0.38 58119 1 58119 2.74 0.10Technique 1385883 1 1385883 21210.76 0.00 7210560 1 7210560 340.36 0.00Granularity*Grouping 193 6 32 0.49 0.81 44041 6 7340 0.35 0.91Granularity*Technique 1613161 6 268860 4114.87 0.00 2737958 6 456326 21.54 0.00Grouping*Technique 167 1 167 2.55 0.11 73680 1 73680 3.48 0.06Gran.*Group.*Tech. 197 6 33 0.50 0.81 32300 6 5383 0.25 0.96Error 14636 224 65 4745439 224 21185
Table 8: Retest-all and minimization ANOVAs.
3.6.3 Test Suite Reduction
To facilitate the comparison between the GHS reduction and retest-all techniques, we depict the data for
these techniques together in Figure 5. The graphs in the first row present results for the retest-all technique,
and the graphs in the second row present results for the GHS reduction technique.
As the graphs show, fault detection effectiveness results for GHS reduction were simillar to the results
observed for the modified non-core entity and minimization RTS techniques, in that reduction left faults
undetected for both grouping strategies and at most granularity levels. Again, bash fared worse than
27
![Page 28: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/28.jpg)
rete
st-
all
Random
1 2 4 8 16 32 640
20
40
60
80
100%
Fa
ults D
ete
cte
d
Functional
1 2 4 8 16 32 64
Random
1 2 4 8 16 32 640
200
400
600
800
1000
Exe
cu
tio
n T
ime
(m
inu
tes)
Functional
1 2 4 8 16 32 64
GH
S r
ed
uctio
n
Random
1 2 4 8 16 32 640
20
40
60
80
100
% F
au
lts D
ete
cte
d
Functional
1 2 4 8 16 32 64
Random
1 2 4 8 16 32 640
200
400
600
800
1000
Exe
cu
tio
n T
ime
(m
inu
tes)
Functional
1 2 4 8 16 32 64
emp-server bash
Figure 5: Fault detection effectiveness for random and functional grouping strategies (columns one and two)and test execution time for random and functional grouping strategies (columns three and four) for theretest-all and test suite reduction techniques across test suite granularities (x-axis), averaged across versions.
emp-server, but not to the same extent as with the RTS techniques. Again, the overall trend is for effec-
tiveness to increase as granularity level increases, although this is more evident for the functional grouping
strategy (where we discount the decrease in effectiveness for emp-server at granularity levels above G8 due
to the non-homogeneity of its test cases at those levels.)
Test execution results for reduction were also similar to results for regression test selection. Test execution
time for emp-server under GHS reduction consistently decreased as granularity level increased, but at
a lower rate than for the control (retest-all) technique. This tendency did not hold, however, for bash,
where execution time increased as granularity level increased. As with the most aggressive RTS techniques,
reduction opportunities can be limited by coarser test cases, but this result varies with program and test
suite characteristics.
We performed an ANOVA to further evaluate these perceived differences and test our hypotheses. Table
9 presents the results for each of our programs and dependent measures. The results for emp-server
28
![Page 29: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/29.jpg)
Techniques: Reduction and Retest-allVariable: Fault-detection effectiveness.
Emp-server BashSource SS DF MS F p SS DF MS F pGranularity 12.33 6 2 5.88 0.00 20.52 6 3 0.48 0.82Grouping 3.11 1 3 8.91 0.00 0.89 1 1 0.13 0.72Technique 24.14 1 24 69.14 0.00 28.67 1 29 4.04 0.05Granularity*Grouping 8.72 6 1 4.16 0.00 14.97 6 2 0.35 0.91Granularity*Technique 13.8 6 2 6.59 0.00 2.19 6 0 0.05 1.00Grouping*Technique 2.68 1 3 7.68 0.01 11.15 1 11 1.57 0.21Gran.*Group.*Tech. 8.26 6 1 3.94 0.00 3.71 6 1 0.09 1.00Error 78.22 224 0 1588.22 224 7Variable: Test execution time.
Emp-server BashSource SS DF MS F p SS DF MS F pGranularity 1862941 6 310490 6287.76 0.00 1471904 6 245317 9.77 0.00Grouping 13 1 13 0.27 0.60 79680 1 79680 3.17 0.08Technique 1046857 1 1046857 21199.99 0.00 3511917 1 3511917 139.92 0.00Granularity*Grouping 171 6 28 0.58 0.75 78088 6 13015 0.52 0.79Granularity*Technique 1420831 6 236805 4795.56 0.00 3428299 6 571383 22.76 0.00Grouping*Technique 559 1 559 11.32 0.00 53012 1 53012 2.11 0.15Gran.*Group.*Tech. 901 6 150 3.04 0.01 34931 6 5822 0.23 0.97Error 11061 224 49 5622322 224 25100
Table 9: Test Suite Reduction ANOVA.
Emp-serverSource: Grouping * TechniqueDependent Variable: Fault Detection EffectivenessGrouping Technique Mean Homogeneous GroupsRandom GHS reduction 9.13 AFunctional GHS reduction 9.56 BRandom retest-all 9.95 CFunctional retest-all 9.97 C
Table 10: Bonferroni results: Emp-server, grouping * technique, fault detection effectiveness, GHS reductionand retest-all.
on fault detection effectiveness of GHS reduction were somewhat surprising: all factors and interactions
were statistically significant. Based on our observations, we had expected granularity level, technique, and
their interaction to be significant. But here we also found an instance in which grouping strategy did
affect fault detection effectiveness. Analysis of interaction effects (see Table 10), however, indicate that the
impact of grouping occured only for the GHS reduction technique. Fault-detection effectivenes results for
bash, in contrast, indicate that for this program, only technique had a significant impact on fault detection
effectiveness. This difference between programs is likely due to the greater variability, for bash, in fault
detection capabilities of its reduced test suites: the overall standard deviation for the percentage of faults
detected for reduced test suites under bash was 22.6 whereas for emp-server it was 9.1.
On both programs, the effects of granularity level, technique and their interaction on test execution time
were statistically significant. This is similar to our findings for the modified non-core entity and minimization
RTS techniques. For emp-server, however, the interaction between grouping strategy and technique, and
grouping strategy, technique, and granularity level, were also significant with respect to test execution
time. This indicates that, although grouping strategy might not be a significant factor on its own, it can
29
![Page 30: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/30.jpg)
Emp-serverSource: Grouping * TechniqueDependent Variable: Test Execution TimeGrouping Technique Mean Homogeneous GroupsRandom GHS reduction 25.52 AFunctional GHS reduction 28.96 BFunctional retest-all 154.89 CRandom retest-all 157.41 C
Table 11: Bonferroni results: Emp-server, grouping * technique, test execution time, GHS reduction andretest-all.
significantly impact the effect of the other factors. For example, Table 11 shows that for GHS reduction,
the mean execution time for functional grouping test suites was significantly larger than that for random
grouping test suites.
3.6.4 Test Case Prioritization
Our fourth experiment considered test case prioritization. Within this methodology we analyze three tech-
niques: optimal prioritization to provide an upper bound on performance, additional function coverage
prioritization, and additional function coverage prioritization incorporating change information. For brevity,
we use the shorter names “optimal”, “coverage”, and “diff-coverage” for these techniques, respectively.
Figure 6 displays three pairs of graphs, two per technique (one per test grouping strategy), with our
measure of rate of fault detection, APFD, on the y axes. Results for both programs appear similar under the
optimal technique, with a slow but consistent decrease in APFD as granularity level increased independent
of grouping strategy. Having more test cases appears to provide greater opportunities for prioritization;
still, the differences are small. The coverage prioritization technique also displayed a decrease in APFD as
granularity level increased. The rate of decrease was greater for this technique than for the optimal technique,
and more obvious for bash than for emp-server. Similar tendencies can be observed for the diff-coverage
technique, which incorporates modification information. These results confirm the observation that lower
levels of granularity enable more effective prioritization than higher levels.
For all techniques, APFD values for bash were lower than those for emp-server, and APFD values
for bash were more strongly affected by increases in granularity level than were results for emp-server.
This may be attributable to the somewhat more complex coverage characteristics of bash’s test suites: the
coverage achieved by individual test cases in that program’s suites (especially when grouped) varies less than
the coverage achieved by individual test cases for emp-server.
Finally, for the coverage and diff-coverage techniques, on bash and at higher levels of granularity, test
suites obtained using the random grouping strategy seem to have generated higher APFD values than test
suites obtained using the functional grouping strategy.
To helps us determine whether the differences observed in the graphs are statistically significant we
performed two ANOVAs, each considering two levels of the technique variable. The choice of technique
levels in each analysis was based on observations derived from exploratory analysis in which the optimal
technique served as a conservative estimate of a theoretical upper bound.
30
![Page 31: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/31.jpg)
AP
FD
Random
1 2 4 8 16 32 640
10
20
30
40
50
60
70
80
90
100
Functional
1 2 4 8 16 32 64
Random
1 2 4 8 16 32 640
10
20
30
40
50
60
70
80
90
100
Functional
1 2 4 8 16 32 64
Random
1 2 4 8 16 32 640
10
20
30
40
50
60
70
80
90
100
Functional
1 2 4 8 16 32 64
Optimal Coverage Diff-Coverage
bash emp-server
Figure 6: APFD values for test case prioritization.
The first ANOVA (Table 12) focuses on the optimal and the coverage based prioritization techniques. For
both programs, granularity level and technique had a significant effect on the value of APFD. This means
that increasing granularity level resulted in significantly different APFD values, and that APFD values can
change significantly based on whether optimal or coverage prioritization techniques are utilized. Grouping
strategy was also a significant factor for bash.
On both programs, however, the significant main interaction effects are constrained by significant inter-
actions involving granularity level and technique. For example, Table 13 shows that for bash, under the
optimal prioritization technique, changes in granularity level did not have a significant impact on APFD
(all the means for optimal are in the same homogeneous group E), whereas for the coverage prioritization
techique, lower level granularities (under the homogeneous group E) generated significantly higher APFD
values than higher level granularities (e.g., granularity level G64 is under homogeneous group A). In addition,
31
![Page 32: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/32.jpg)
Techniques: Optimal and CoverageVariable: APFD.
Emp-server BashSource SS DF MS F p SS DF MS F pGranularity 398 6 66 11.60 0.00 9171 6 1529 42.37 0.00Grouping 0 1 0 0.04 0.84 153 1 153 4.25 0.04Technique 806 1 806 141.07 0.00 7701 1 7701 213.46 0.00Granularity*Grouping 73 6 12 2.13 0.05 1657 6 276 7.65 0.00Granularity*Technique 180 6 30 5.26 0.00 5657 6 943 26.13 0.00Grouping*Technique 0 1 0 0.06 0.80 144 1 144 3.99 0.05Gran.*Group.*Tech. 74 6 12 2.15 0.05 1690 6 282 7.81 0.00Error 1280 224 6 8082 224 36
Table 12: Optimal and Coverage Prioritization ANOVAs.
BashSource: Granularity * TechniqueDependent Variable: APFDGranularity Technique Mean Homogeneous Groups64 Coverage 65.05 A32 Coverage 80.72 B16 Coverage 87.58 BC8 Coverage 90.50 CD4 Coverage 94.77 DE64 Optimal 96.09 DE2 Coverage 97.41 DE1 Coverage 97.60 E32 Optimal 97.65 E16 Optimal 98.63 E8 Optimal 99.30 E4 Optimal 99.64 E2 Optimal 99.82 E1 Optimal 99.90 E
Table 13: Bash, granularity * technique, APFD.
for bash there were significant interactions between grouping strategy and granularity level, and granular-
ity level, grouping strategy, and technique, and these further constrain the implications of the main effect
results.
The second ANOVA (Table 14) involves the optimal and diff-coverage techniques. The results follow the
significance patterns observed in the previous analysis but with fewer interactions (none for emp-server and
two for bash), which places fewers constraints on the main effects findings. Granularity level and technique
were significant factors for both programs. Grouping strategy was a significant factor for bash, but the high
level of interaction between grouping and technique prompted us to analyze this further. Table 15 presents
the Bonferroni test results on the interaction between grouping and technique, indicating that grouping has
a significant effect only for the diff-coverage technique (random and functional grouping strategies under
optimal belong to the same homogeneous group C).
32
![Page 33: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/33.jpg)
Techniques: Optimal and Diff-CoverageVariable: APFD.
Emp-server BashSource SS DF MS F p SS DF MS F pGranularity 274 6 46 2.21 0.04 4728 6 788 4.88 0.00Grouping 2 1 2 0.09 0.76 788 1 788 4.88 0.03Technique 2058 1 2058 99.63 0.00 20205 1 20205 125.19 0.00Granularity*Grouping 160 6 27 1.29 0.26 607 6 101 0.63 0.71Granularity*Technique 114 6 19 0.92 0.48 2485 6 414 2.57 0.02Grouping*Technique 2 1 2 0.11 0.74 766 1 766 4.75 0.03Gran.*Group.*Tech. 152 6 25 1.22 0.30 646 6 108 0.67 0.68Error 4626 224 21 36153 224 161
Table 14: Optimal and Diff-Coverage Prioritization ANOVAs.
BashSource: Grouping * TechniqueDependent Variable: APFDGrouping Technique Mean Homogeneous GroupsFunctional Diff-Coverage 77.30 ARandom Diff-Coverage 84.32 BFunctional Optimal 98.69 CRandom Optimal 98.74 C
Table 15: Bash, grouping * technique, APFD.
4 Discussion
We begin our discussion of results by summarizing the overall implications of the foregoing analyses for our
hypotheses. Tables 16 and 17 present summaries for emp-server and bash, respectively. The tables show,
for each analysis performed, for each source of variation and interaction considered, and for each dependent
variable of interest, whether that source of variation or interaction was statistically significant or not in
influencing that dependent variable. Asterisks denote significance and hyphens its absence. Blank entries
under the retest-all column are cases in which analyses did not apply (technique was not a source of variation
in these cases). The modified entity technique behaved the same as the retest-all technique and thus we
omit it from the tables.
With respect to hypothesis H1 (test suite granularity does not have a significant impact on the costs and
benefits of regression testing techniques), our results strongly support the alternative hypothesis. Test suite
granularity had a significant impact on the efficiency of regression testing (as measured by test execution
time) for retest-all, regression test selection, and test suite reduction methodologies: this result occurred in
all cases other than that of the modified entity technique, and was consistent across programs. Granularity
also significantly affected the rate of fault detection achieved by prioritization techniques; this result too was
consistent across programs. Finally, granularity did significantly impact the rate of fault detection achieved
through regression testing techniques, but this result occurred only for emp-server, under the modified
non-core entity RTS and GHS reduction techniques.
With respect to hypothesis H2 (test input grouping does not have a significant impact on the costs
and benefits of regression testing techniques), we are not able to unequivocally reject the null hypothesis.
33
![Page 34: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/34.jpg)
retest-all selection reduction prioritization
mod’d noncore minimization GHS coverage diff-coveragevs retest-all vs retest-all vs retest-all vs optimal vs optimal
exec fde exec fde exec fde exec fde apfd apfd
granularity * - * * * - * * * *grouping - - - - - - - * - *technique * * * * * * * *gran*grp - - - - - - - * - *gran*tech * * * - * * * *grp*tech - - - - * * - -gran*grp*tech - - - - * * - *
Table 16: Summary of significant effects for emp-server. Columns headed “exec” pertain to execution time,and columns headed “fde” pertain to fault detection effectiveness. “*” entries indicate cases in which thesource of variation or interaction listed in column 1 was statistically significant, and “-” entries indicate caseswhere significance was not found.
retest-all selection reduction prioritization
mod’d noncore minimization GHS coverage diff-coveragevs retest-all vs retest-all vs retest-all vs optimal vs optimal
exec fde exec fde exec fde exec fde apfd apfd
granularity * - * - * - * - * *grouping - - - - - - - - - *technique * * * * * * * *gran*grp - - - - - - - - - -gran*tech * - * - * - - *grp*tech - - - - - - - *gran*grp*tech - - - - - - - -
Table 17: Summary of significant effects for bash. Columns headed “exec” pertain to execution time, andcolumns heade “fde” pertain to fault detection effectiveness. “*” entries indicate cases in which the source ofvariation or interaction listed in column 1 was statistically significant, and “-” entries indicate cases wheresignificance was not found.
Among the retest-all, regression test selection, and GHS reduction techniques, test input grouping exhibited
a significant effect in only one case: that of GHS reduction applied to emp-server. Test input grouping also
affected test case prioritization, but only for the diff-coverage technique.
With respect to hypothesis H3 (regression testing techniques do not perform significantly differently
in terms of the selected costs and benefits measures), results are consistent with those observed in earlier,
comparative studies of the techniques considered. For example, previous studies have shown that the modified
entity technique, applied at the function level, may achieve no savings [3], that test suite reduction can exhibit
varying degrees of fault detection effectivenes loss [38, 43], that tradeoffs between test execution time and
fault-detection effectiveness among non-safe RTS techniques such as those we have seen here exist [15], and
that the prioritization techniques [11] we examined here relate to one another in the way we have observed
here. In the context of this article, where our primary interest lies in observing the effects of test suite
granularity and test input grouping, this consistency with earlier results is important primarily because it
supports the conjecture that our results will generalize beyond the cases considered.
34
![Page 35: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/35.jpg)
Finally, with respect to hypothesis H4 (test suite granularity and test input grouping effects across
regression testing techniques and programs do not significantly differ), we discovered frequent interaction
effects between technique and granularity, and our Bonferroni analyses further analyze these interactions.
In all but one case (coverage versus optimal prioritization on bash), a significant granularity effect was
accompanied by a significant granularity-technique interaction: the implication is that when granularity
has an effect, different regression testing techniques are affected differently than the control technique as
granularity level changes. Interactions with grouping are less frequent; in the few cases in which grouping
exhibited an impact, it also interacted with technique, and in one case (execution time for GHS reduction)
interaction effects were observed even though grouping alone had not had an impact.
Taken together, these results suggest that test suite granularity plays an important role in regression
testing cost-effectiveness – a role that merits attention by practitioners and further exploration by researchers.
The results further suggest that test input grouping may matter, but plays a less important role than
granularity. Moreover, the relative consistency of test suite granularity results across two quite different
test input groupings itself supports a conjecture that the test suite granularity results observed here may
generalize to other test input groupings.
Issues of the generality of these results can be more conclusively addressed only through replication of
these experiments on additional workloads. We can, however, suggest several further practical implications
of the results, and draw several additional observations on the data, as follows.
4.1 Implications for Common Practice (Retest-All)
The retest-all technique is arguably the most prevalently used regression testing technique in practice [30],
and is particularly appropriate when complete test suites can be executed, and their results validated, in
an amount of time considered reasonable by the testing organization (e.g., when fully automated test suites
have automated oracles and can run to completion overnight).
Our results show that the use of coarse granularity test suites can greatly increase efficiency for the
retest-all technique. For example, increasing granularity level from G1 to G4 on the emp-server test suite
saved an average of 365 minutes (a 72% reduction) in test execution time under retest-all. The same
granularity level increase on bash saved 415 minutes (a 53% reduction) in test execution time. Our results
also show that granularity need not adversely impact fault-detection effectiveness under retest-all: across all
our observations of the technique, only five cases occurred in which a fault detected at a lower granularity
level was not detected at a higher level when the entire test suite was executed.
The implication of these results, when coupled with their consistency across programs and test input
groupings, is that test engineers can safely harness granularity to increase the likelihood that they can afford
to use the retest-all technique.
This conclusion should, however, be qualified. The savings we observed in test suite execution time in our
experiments, using coarse granularity test suites, can be attributed primarily to reductions in the overhead
associated with test setup and cleanup. Our coarse granularity test suites apply just as many inputs as their
constituent test grains, but require less overhead in the number of setup and cleanup operations required.
35
![Page 36: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/36.jpg)
Test suites in which test cases have lower overhead than these would be less conducive to providing practically
meaningful time savings through increases in granularity. For such suites other factors, such as the support
that fine granularity provides for prioritization or the greater simplicity of localizing faults uncovered by
small granularity test cases, may be of more value in establishing an appropriate granularity.
A second qualification concerns effects that may occur due to program complexity relative to input
size. The programs we have studied, like many programs, typically execute in time (rougly speaking) linear
in input size. For such programs, it is easy to envision why, in the case in which test cases incur some
startup costs, increased test suite granularity should lead to reduced execution time. Programs with higher
complexity relative to input size, however, such as programs that run in time quadratic in input size, may
display different relationships, because the increase in processing time incurred due to larger inputs may be
greater than the savings in execution time incurred due to coarser granularity.
It is also worth noting that, in our studies, the efficiency gains achieved by increasing granularity level
were greatest when starting from low granularities. For example, increasing granularity level from G1 to
G4 on emp-server saved 365 minutes in test execution time, but doubling granularity level further to G8
saved only 80 additional minutes, and doubling it again to G16 saved only 20 minutes more. As granularity
level increases, the returns achieved from further increases diminish, and this may allow factors other than
granularity to take on greater practical importance, above a certain granularity level, than further increases
in level of granularity. The results of our experiments, therefore, should not be interpreted to imply that the
most cost-effective granularity level in practice for a given test suite T is |T |.A final qualification concerns the effects of test oracle accuracy. In practice, regression testing oracles
range from those that exhaustively compare all components of program output to those that simply check
subsets of system state at specific checkpoints (as is often done, for example, by JUnit test cases through
embedded assertions). Intuitively, we expect this range of oracle rigour to contribute to both the fault
detection effectiveness of test suites, and the expense of executing those suites.
In our initial study of the effects of test suite granularity [33], when considering the retest-all technique
applied to a subset of the versions of bash and emp-server considered here, we found (in contrast to our
findings here) that coarser granularity test suites detected faults more effectively than finer granularity suites.
We surmised that these gains in fault detection effectiveness might be partially attributed to the execution,
by test cases in coarse granularity test suites, of additional code that causes data state changes occurring in
earlier stages of execution to be visible. We reasoned that fine granularity test suites could be more effective
if they were equipped with better oracles.
For these experiments, we improved the oracles used to validate the results of emp-server and bash test
cases; the improved techniques analyze additional output data beyond, and perform more precise differencing
of data than, the techniques used in the previous experiments. This resulted in an increase in the fault
detection effectiveness of our test cases and test suites to the levels observed here, where the fault detection
effectiveness of test suites under the retest-all technique was not significantly affected by test suite granularity.
These observations yield an interesting conclusion about test oracle design as it relates to test suite
granularity. Inaccurate oracles may be a greater liability than accurate ones for fault detection effectiveness
36
![Page 37: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/37.jpg)
at low levels of granularity than at high levels. At high granularity levels, test cases apply more inputs and
have greater opportunities to detect discrepancies than at low levels, compensating for oracle limitations.
Together the results of our experiments suggest that, if practitioners equip their test cases with the right
observers, they should be less likely to perceive differences in fault detection effectiveness as granularity
increases than when they employ weaker oracles.
4.2 Implications for Regression Test Selection
An advantage of the retest-all technique is that it does not discard test cases that could reveal faults, and this
advantage was illustrated in our experiments. Nevertheless, the retest-all technique is not always a viable or
cost-effective option. Re-execution of the entire test suite may require more time than an organization can
spare, or require large amounts of expensive human effort (e.g., when validation is not automated) that could
be better spent on other tasks. In such cases, engineers may use regression test selection (RTS) techniques
to choose the test cases that are important for use in validating a particular system release.
Safe RTS techniques are guaranteed (under specific conditions) to not omit, in their selection, test cases
that can reveal faults, and have been shown to reduce regression testing time [7, 15, 34, 37]. In our particular
experiments, however, for the particular test suites and program versions we utilized, the (safe) modified
entity technique always selected all test cases, and provided no savings. We thus focus on the implications
of our results for non-safe RTS techniques.
Non-safe RTS techniques provide a wide range of efficiency/effectiveness tradeoffs, balancing the ben-
efits gained in test execution costs against the risks involved in losing fault detection effectiveness. Our
experiments considered both an aggressively selective technique (minimization RTS), and a less aggressively
selective technique (modified non-core entity RTS), and our results have several implications for each of these
approaches.
Considering test suite granularity first, and beginning with the less aggressive modified non-core entity
technique, finer granularity test suites are clearly more supportive of modified non-core entity regression test
selection than coarser granularity suites, since the ability of the technique to decrease regression testing time
decreased rapidly as granularity level increased. This tendency was evident for both programs. For example,
when the modified non-core entity technique was applied to the level G1 test suite for emp-server, that test
suite’s average execution time was reduced from 505 to 180 minutes (a 64% time reduction). When the same
technique was applied to the level G64 suite for emp-server, the average saving was less than 9%. Finer
granularity provides greater flexibility than coarse granularity, by promoting larger numbers of small test
cases that can be successfully manipulated by the technique to reduce the cost of test execution.
Our results also show that, even when organizations can afford to employ a retest-all technique using
test suites composed at some granularity level Gk, they may be able to save regression testing time by using
some lower level of granularity together with the modified non-core entity RTS technique, and gain any
other benefits (e.g., fault localization ease or prioritizability) that accrue for finer granularity test cases. For
example, on bash, modified non-core entity RTS applied to level G1 test suites resulted in more efficient
re-testing than retest-all applied to level G2 or G4 test suites. Whether such efficiency gains are worthwhile
37
![Page 38: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/38.jpg)
depends, however, on whether the fault detection effectiveness loss that can accompany modified non-core
entity is acceptable to the testing organization.
When we turn to aggressively selective RTS techniques, represented here by minimization RTS, we find
much greater potential for savings, with much greater potential for fault detection loss. Here, however, the
effects of granularity are somewhat mixed: on one of our programs (emp-server), higher granularity levels
produced savings in test execution time, and on the other (bash) they reduced savings, and granularity level
had no effect on fault detection effectiveness. We suspect that the very aggressiveness of the minimization
RTS technique, which leads to relatively large test suite reductions, may cause granularity effects to assume
less influence on cost-effectiveness than other factors.
Considering test input grouping, we were unable to reject the null hypothesis (H2) for non-safe RTS
techniques for execution time or fault detection effectiveness loss. This suggests that our results with respect
to test suite granularity are to some degree robust over the test suite characteristics captured by our test
input grouping construct. An implication is that future empirical work in this area could focus, without loss
of internal validity, on studying the effects of test suite granularity and technique.
Finally, our data suggests that fault difficulty can influence granularity effects. As noted earlier, most
faults in bash were relatively difficult to expose, with 99% revealed by fewer than 1% of that program’s
level G1 test cases. In contrast, only 23% of the faults in emp-server were exposed by fewer than 1% of
that program’s level G1 test cases. This difference can be seen as responsible for two effects identified in
Section 3.6.2. First, with the modified non-core entity RTS technique, granularity significantly affected fault
detection effectiveness on emp-server, with higher granularity levels typically increasing effectiveness; but
this result did not occur for bash. Second, with the minimization RTS technique, fault detection effectiveness
varied more greatly across granularity levels for bash than for emp-server.
With respect to the increases in fault detection effectiveness that accompany granularity level increases
for emp-server, we expect that the “observer effect” previously mentioned is at least partially responsible
for these results. Test cases in coarser granularity test suites have somewhat greater fault detection abilities
than their counterparts in finer granularity suites due to increased opportunities for state and output changes
to be revealed. But these results suggest that fault difficulty plays a role in this effect.
With respect to the variance in fault detection effectiveness seen for bash, however, a different factor
emerges. Test cases that expose faults singly can fail to do so when composed with other test cases due to
interactions. When faults are detected by only a few test inputs in a fine granularity test suite, relatively
few interactions need occur in a coarse granularity suite composed of those test inputs to cause those faults
to go undetected there. When subsets of test suites are selected, the likelihood that difficult-to-detect faults
will go undetected in coarse granularity test suites increases further, because there are fewer opportunities
for including test cases in which fault-masking interactions do not occur. (This is further evident in the fact
that, for the retest-all technique, the effects of fault difficulty did not influence fault-detection effectiveness.)
If these results generalize, an implication that when practitioners utilize coarse granularity test suites,
they may expect that these suites will be relatively strong (compared to fine granularity suites) at revealing
relatively easy-to-detect faults, but relatively weak at revealing difficult-to-detect faults.
38
![Page 39: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/39.jpg)
4.3 Implications for Test Suite Reduction
Test suite reduction and minimization RTS each seek test suite subsets that provide minimal coverage of
specific program entities (e.g. functions); they differ, however, in that reduction seeks minimal coverage of
all covered program entities, while minimization RTS seeks minimal coverage of covered modified entities.
The primary effect of this difference, in our study, is that test suite reduction results in larger test suites
than minimization RTS. This size difference results in greater fault detection effectiveness, and greater test
execution times, for test suites reduced by GHS reduction than for those selected by minimization RTS. In
cases where aggressive reduction in testing effort is needed, GHS reduction may be a more cost-effective
alternative than minimization RTS.
Where test suite granularity is concerned, GHS reduction shares most of the effects and implications seen
for minimization RTS. The effects of granularity on test execution time are somewhat mixed across programs,
with coarse granularity adversely impacting execution time for bash, but improving it for emp-server.
Fault detection effectiveness loss is greater for bash, with its relatively difficult-to-detect faults, than for
emp-server, but variance in detection is greater for bash. Thus, we cannot yet provide to practitioners,
based on our data, clear evidence that any particular choice of granularity level is generally most cost-effective
than other choices for reduction.
The one effect observed for GHS reduction that was not shared with minimization RTS involves the effects
of test input grouping observed on emp-server. On this program, test input grouping was a significant
factor for fault detection effectiveness; in particular, functional grouping yielded better and more consistent
fault detection effectiveness, for reduced test suites, than random grouping. This suggests that in at least
some cases, functional grouping test suites may be preferable to random suites for practitioners anticipating
applying test suite reduction. Because this effect did not occur on bash, we conjecture that it may not hold
with respect to relatively hard-to-detect faults. On the other hand, functional grouping did not adversely
affect results on bash, either, so its use may not carry risk.
4.4 Implications for Test Case Prioritization
Whereas the retest-all, regression test selection, and test suite reduction methodologies are essentially mu-
tually exclusive, test case prioritization can be applied in conjunction with these methodologies to order all
test cases, selected test cases, or reduced test suites. This has implications for our prioritization results.
For example, our results show that finer test suite granularity is likely to provide greater opportunities for
prioritization and support higher APFD values than coarser granularity. This occurs because when coarse
granularity test cases are decomposed into finer granularity ones, the scope of the effects of the average test
case (e.g., its coverage, or its relationship with changed code) decreases, allowing prioritization techniques
to more precisely discriminate between test cases. This provides additional impetus to engineers using the
retest-all technique to choose a middle ground in granularity if they care about rate of fault detection. It
also provides an additional argument for engineers using RTS techniques to use fine granularity test cases.
In particular, engineers employing safe RTS techniques may need some method for responding to cases
in which their techniques fail to reduce test suite size, as occurred for the modified entity technique in our
39
![Page 40: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/40.jpg)
studies. One response involves falling back on prioritization as a technique for placing important test cases
early, facilitating faster detection of (and response to) faults. Fine granularity test suites facilitate this.
It is important to note, however, that these implications also vary with the difficulty of detecting the
faults that exist in the program under test. Programs for which the number of fault-exposing test cases is
large are less likely to suffer APFD losses from increases in granularity than programs for which the number
of fault-exposing test cases is small. This result was most evident in our data when considering the APFD
results for the coverage prioritization technique under the functional test input grouping. In this case, the
APFD for bash was reduced by 37 points as granularity level increased from G1 to G64, whereas the APFD
for emp-server (which had fewer difficult-to-detect faults) was reduced by only 6 points.
One further implication of this consideration pertains to testing processes, which are typically driven by
tradeoffs between the expense of testing and the desire to detect faults. Where rate of fault detection is
concerned, when running test cases during development (especially as in test-driven development processes,
or test-every-night processes) where initial, easier-to-find faults might be expected to be common, coarse-
grained test cases that run faster due to lower setup time requirements will be most cost-effective. When
running system tests at the end of development cycles, where the probabilities of individual test cases failing
are smaller and the testing interval may be somewhat longer, fine-grained test cases will be most cost-
effective. Test suite designers might do well, therefore, to build flexibility into their test suites, such that
the granularity of those suites can be adjusted to meet the needs of particular testing stages.
Finally, where our prioritization results are concerned, in all but one case considered (coverage versus
optimal on bash), significance in granularity was accompanied by significance in granularity-technique in-
teraction: when granularity has an effect, different techniques are affected differently as granularity level
changes. For practitioners, the implication of this is that, in judging the relative effectiveness of techniques,
it is not sufficient to consider just results of those techniques, granularity must also be considered. For
researchers, the implication of this is that, when experimenting with techniques, it is important to specify
the workload (test suite characteristics) being utilized.
4.5 The Effects of Granularity and Grouping on Fault Detection per Test Case
In the preceding analyses and discussion we focused on a measure of fault-detection-effectiveness relative to
test suites, or reduced or selected subsets of test suites. Under this measure, for the retest-all methodology
and using our improved oracle and failure detection tools, our test suites did not lose significant fault-
detection effectiveness as granularity increased or decreased. Reduced or selected subsets of test suites,
however, did lose fault-detection-effectiveness, and did exhibit fault-detection-effectiveness that varied at
different test suite granularities.
To investigate the cause of this difference further, we look further at our data, turning our attention away
from entire test suites or test suite subsets, and towards the fault-detection effectiveness of the individual
test cases that compose these suites and subsets. To do this, we consider each of our coarse-grained test
cases at each granularity level Gk, and investigate the fault-detection effectiveness of these test cases singly,
versus the fault-detection effectiveness of their constituent level G1 test cases. On this view, with respect to
40
![Page 41: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/41.jpg)
a specific test case tGk at level Gk (k > 1), and its constituent set S(tGk), the set of all granularity level G1
test cases t1, t2, . . . , tk, used to construct it, and with respect to a particular fault f , four categories of test
cases exist:
1. Equal-omission: tGk fails to detect f , and each test case ti ∈ S(tGk) fails to detect f .
2. Detection-lost: tGk fails to detect f , but there exists at least one test case ti ∈ S(tGk) that detects f .
3. Detection-gained: tGk detects f , even though each test case ti ∈ S(tGk) fails to detect f .
4. Equal-detection: tGk detects f , and there exists at least one test case ti ∈ S(tGk) that detects f .
These categories can help us track whether the process of composing test cases into coarser test cases
causes gains or losses in fault-detection effectiveness at the level of individual Gk test cases, as opposed to
the level of entire test suites composed of Gk test cases.
Figure 7 uses a stacked-bar chart to depict the percentages of test cases in our test suites that fall into
each of the foregoing categories, for granularity levels G2 through G64. The chart on the left corresponds
to random groupings, and the chart on the right corresponds to functional groupings. In each chart, the
horizontal axis represents test suite granularity, and the vertical axis, scaled 0 through 100%, represents
percentages of the total number of test cases in a test suite. A pair of bars are shown together at each
granularity level; the first corresponds to bash and the second to emp-server. Each bar is a composite,
with constituent bars stacked over one another, representing the equal-omission, detection-lost, detection-
gained, and equal-detection categories, from top to bottom, respectively. The percentage of test cases in each
category, under each granularity level and grouping strategy, is averaged across all the faults and versions of
each program.
Consider the results for level G2 under the random test input grouping (the two leftmost bars in the
leftmost chart). The first bar corresponds to bash and shows that more than 99% of the G2 test cases for
bash were classified as equal-omission: the constituent test cases did not detect faults, and composing them
caused no change in fault detection. Although not discernable in the graph, only an average of 0.33% of
the G2 test cases for bash were classified as equal-detection, detecting one or more faults also detected by
constituents. No detection-lost test cases were found, and only 0.12% of the test cases were classified as
detection-gained. Results for emp-server for level G2 are similar except that almost 9% of the G2 test cases
for emp-server classified as equal-detection (reflecting the fact that emp-server had more relatively easy
to detect faults than bash).
Continuing in this manner of observation across granularity levels, we observe that for bash, the percent-
age of equal-detection test cases increases consistently as granularity level increases. We also observe that
the percentage of detection-gained test cases for bash increases with granularity level from approximately
1% at level G16 to about 3% at level G64. Detection-lost cases are noticeable only at level G64, where on
average less than one test case masks a fault. For emp-server, however, detection-lost cases outnumber
detection-gained cases, especially at levels G16, G32, and G64.
41
![Page 42: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/42.jpg)
Random Functional
bashempire-server
Equal-omission
Detection-lostDetection-gained
Equal-detection
Bars show Means
G2 G4 G8 G16 G32 G64Test Suite Granularity
0%
20%
40%
60%
80%
100%
Per
cent
age
of T
est C
ases
G2 G4 G8 G16 G32 G64Test Suite Granularity
Figure 7: Fault-detection effectiveness effects at the individual test case level.
In the case of functional groupings (the rightmost chart in Figure 7) we observe a difference in the
percentage of equal-omission test cases when compared with random grouping. This is clearly noticeable
for emp-server from levels G2 through G16. In other words, although the test suite’s overall effectiveness
remained the same across groupings, fewer test cases revealed faults. Bash under functional grouping exhibits
a slightly larger percentage of detection-lost test cases at levels G32 and G64 in relation to random grouping.
Correspondingly, the percentage of detection-gained test cases at these levels for bash is smaller for functional
grouping than for random grouping. Emp-server exhibits an opposite trend, with slightly reduced detection-
lost test cases for functional grouping than random grouping at the G32 and G64 granularity levels.
Overall, increases in granularity level are associated with increases in the percentages of both detection-
lost and detection-gained test cases. Further, at lower granularities, functional grouping test suites have a
smaller percentage of equal-detection test cases.
The significance of this discussion lies partly in its ability to help explain our fault-detection effectiveness
results for retest-all, regression test selection and test case reduction. The test cases in our test suites
42
![Page 43: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/43.jpg)
are collectively strong enough to reveal all of the faults in our programs. When test inputs are composed
into coarser granularity test cases, the fault-revealing capabilities of many individual test cases change, but
situations in which detection power is lost by some test cases are compensated for by other test cases.
This held true both for emp-server with its somewhat more frequently detected faults, and for bash with
its somewhat less frequently detected faults, for the retest-all technique. For that technique (or for safe
RTS techniques in general), fault-detection effectiveness at the test suite level is what matters, and in our
experiments, granularity effects did not significantly affect fault-detection effectiveness at that level.
When considering regression testing techniques that select from among test cases (regression test selection
and test suite reduction), or test case prioritization techniques that evaluate results relative to individual test
cases, the situation changes. Here, the potential for test case granularity to alter fault-detection-effectiveness
at the individual test case level takes on greater importance, because as the number of test cases composing a
test suite is reduced, the importance of individual test cases relative to the entire suite increases. This factor
likely contributes to the cases in which our selective methodologies and test case prioritization techniques
exhibit significant effects in fault detection effectiveness as test suite granularity varies.
5 Conclusion
Writers of testing textbooks have long shown awareness that the composition of test suites can affect the
cost-effectiveness of testing. These effects can begin when testing the initial release of a system, where success
in finding faults in that release, as well as the amount of testing that can be accomplished, can vary based on
test suite granularity and test input grouping. Software that succeeds, however, subsequently evolves: the
costs of testing that software are compounded over its lifecycle, and the opportunity to miss faults through
inadequate regression testing occurs with each new release. It is thus imperative that researchers study the
effects of test suite design across the entire software lifecycle.
Several test suite design factors, such as test suite size and adequacy criteria, have been empirically
studied, but few have been studied with respect to evolving software. Several regression testing methodologies
have been empirically studied, but few with respect to issues in test suite design. This article brings the
empirical study of test suite design and regression testing methodologies together, focusing on two particular
design factors: test suite granularity and test input grouping. Our results highlight several cost-benefits
tradeoffs associated with these factors, and related to regression testing techniques and processes.
Empirical studies such as those that we have described here can provide evidence for or against hypotheses
such as those we have investigated, but cannot prove them. Instead, validity concerns must be addressed
by additional studies using different programs and other artifacts, alternative measures, and alternative
methodologies. Only through such repetition can a body of evidence be built on behalf of such hypotheses,
rendering results more general. This work lays the groundwork for such further studies.
43
![Page 44: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/44.jpg)
ACKNOWLEDGEMENTS
This work was supported by the NSF Information Technology Research program under Awards CCR-0080898
and CCR-0080900 to University of Nebraska, Lincoln and Oregon State University, and by NSF Awards CCR-
9703108 and CCR-9707792 to Oregon State University. We thank Satya Kanduri and Srikanth Karre for
helping prepare the emp-server and bash subjects.
References
[1] J. Bach. Useful features of a test automation system (part iii). Testing Techniques Newsletter, October
1996.
[2] B. Beizer. Black-Box Testing. John Wiley and Sons, New York, NY, 1995.
[3] J. Bible, G. Rothermel, and D. Rosenblum. Coarse- and fine-grained safe regression test selection. ACM
Transactions on Software Engineering and Methodology, 10(2):149–183, April 2001.
[4] R. Binder. Testing Object-Oriented Systems. Addison Wesley, Reading, MA, 2000.
[5] D. Binkley. Semantics guided regression test cost reduction. IEEE Transactions on Software Engineer-
ing, 23(8), August 1997.
[6] T.Y. Chen and M.F. Lau. Dividing strategies for the optimization of a test suite. Information Processing
Letters, 60(3):135–141, March 1996.
[7] Y.F. Chen, D.S. Rosenblum, and K.P. Vo. TestTube: A system for selective regression testing. In
Proceedings of the 16th International Conference on Software Engineering, pages 211–220, May 1994.
[8] S. Elbaum, K. Kallakuri, A. G. Malishevsky, G. Rothermel, and S. Kanduri. Understanding the Effects
of Changes on the Cost-Effectiveness of Regression Testing Techniques. Journal of Software Testing,
Verification, and Reliability, 13(2):–, June 2003.
[9] S. Elbaum, A. Malishevsky, and G. Rothermel. Prioritizing test cases for regression testing. In Proceed-
ings of the International Symposium Software Testing and Analysis, pages 102–112, August 2000.
[10] S. Elbaum, A. Malishevsky, and G. Rothermel. Incorporating varying test costs and fault severities into
test case prioritization. In Proceedings of the 23rd International Conference on Software Engineering,
pages 329–338, May 2001.
[11] S. Elbaum, A. G. Malishevsky, and G. Rothermel. Test case prioritization: A family of empirical studies.
IEEE Transactions on Software Engineering, 28(2):159–182, February 2002.
[12] S. Elbaum, J. Munson, and M. Harrison. CLIC: A tool for the measurement of software system dynamics.
In SETL Technical Report - TR-98-04., 04 1998.
44
![Page 45: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/45.jpg)
[13] S. Elbaum, G. Rothermel, S. Kanduri, and A. G. Malishevsky. Selecting a cost-effective test case
prioritization techniques. Technical Report 03-01-01, University of Nebraska - Lincoln, January 2003.
[14] K.F. Fischer, F. Raji, and A. Chruscicki. A methodology for retesting modified software. In Proceedings
of the Nat’l. Tele. Conference B-6-3, pages 1–6, November 1981.
[15] T.L. Graves, M.J. Harrold, J-M Kim, A. Porter, and G. Rothermel. An empirical study of regression
test selection techniques. In Proceedings of the 20th International Conference on Software Engineering,
pages 188–197, April 1998.
[16] R. Gupta, M.J. Harrold, and M.L. Soffa. An approach to regression testing using slicing. In Proceedings
of the Conference on Software Maintenance, pages 299–308, November 1992.
[17] M. J. Harrold, R. Gupta, and M. L. Soffa. A methodology for controlling the size of a test suite. ACM
Transactions on Software Engineering and Methodology, 2(3):270–285, July 1993.
[18] M.J. Harrold and G. Rothermel. Aristotle: A system for research on and development of program
analysis based tools. Technical Report OSU-CISRC- 3/97-TR17, Ohio State University, Mar 1997.
[19] J. Hartmann and D.J. Robson. Revalidation during the software maintenance phase. In Proceedings of
the Conference on Software Maintenance, pages 70–79, October 1989.
[20] R. Hildebrandt and A. Zeller. Minimizing failure-inducing input. In Proceedings of the International
Symposium on Software Testing and Analysis, pages 135–145, August 2000.
[21] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments on the effectiveness of dataflow- and
controlflow-based test adequacy criteria. In Proceedings of the International Conference on Software
Engineering, pages 191–200, May 1994.
[22] J. A. Jones and M. J. Harrold. Test-suite reduction and prioritization for modified condition/decision
coverage. In Proceedings of the International Conference on Software Maintenance, pages 92–101, Oc-
tober 2001.
[23] C. Kaner, J. Falk, and H. Q. Nguyeen. Testing Computer Software. Wiley and Sons, New York, 1999.
[24] J-M Kim, A. Porter, and G. Rothermel. An empirical study of regression test application frequency. In
Proceedings of the 22nd International Conference on Software Engineering, pages 126–135, June 2000.
[25] E. Kit. Software Testing in the Real World. Addison-Wesley, Reading, MA, 1995.
[26] H.K.N. Leung and L. White. Insights into regression testing. In Proceedings of the Conference on
Software Maintenance, pages 60–69, October 1989.
[27] H.K.N. Leung and L.J. White. A study of integration testing and software regression at the integration
level. In Proceedings of the Conference on Software Maintenance, pages 290–300, November 1990.
45
![Page 46: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/46.jpg)
[28] D. Libes. Exploring Expect: A Tcl-Based Toolkit for Automating Interactive Programs. O’Reilly &
Associates, Inc., Sebastopol, CA, November 1996.
[29] J. Offutt, J. Pan, and J. M. Voas. Procedures for reducing the size of coverage-based test sets. In
Proceedings of the Twelfth International Conference on Testing Computer Software, pages 111–123,
June 1995.
[30] K. Onoma, W-T. Tsai, M. Poonawala, and H. Suganuma. Regression testing in an industrial environ-
ment. Communications of the ACM, 41(5):81–86, May 1988.
[31] T.J. Ostrand and M.J. Balcer. The category-partition method for specifying and generating functional
tests. Communications of the ACM, 31(6), June 1988.
[32] C. Ramey and B. Fox. Bash Reference Manual. O’ReillyO’Reilly & Associates, Inc., Sebastopol, CA,
2.2 edition, 1998.
[33] G. Rothermel, S. Elbaum, A. Malishevsky, P. Kallakuri, and B. Davia. The impact of test suite
granularity on the cost-effectiveness of regression testing. In Proceedings of the International Conference
on Software Engineering, May 2002.
[34] G. Rothermel and M. J. Harrold. Empirical studies of a safe regression test selection technique. IEEE
Transactions on Software Engineering, 24(6):401–419, June 1998.
[35] G. Rothermel and M.J. Harrold. Analyzing regression test selection techniques. IEEE Transactions on
Software Engineering, 22(8):529–551, August 1996.
[36] G. Rothermel and M.J. Harrold. A safe, efficient regression test selection technique. ACM Transactions
on Software Engineering and Methodology, 6(2):173–210, April 1997.
[37] G. Rothermel, M.J. Harrold, and J. Dedhia. Regression test selection for C++ programs. Journal of
Software Testing, Verification, and Reliability, 10(2), June 2000.
[38] G. Rothermel, M.J. Harrold, J. Ostrin, and C. Hong. An empirical study of the effects of minimization
on the fault detection capabilities of test suites. In Proceedings of the International Conference on
Software Maintenance, pages 34–43, November 1998.
[39] G. Rothermel, R. Untch, C. Chu, and M.J. Harrold. Test case prioritization. IEEE Transactions on
Software Engineering, October 2001.
[40] A. Srivastava and J. Thiagarajan. Effectively Prioritizing Tests in Development Environment. In
Proceedings of the International Symposium on Software Testing and Analysis, July 2002.
[41] F. I. Vokolos and P. G. Frankl. Pythia: a regression test selection tool based on textual differencing.
In Proceedings of the 3rd International Conference on Rel., Quality & Safety of Software-Intensive Sys.
(ENCRESS ’97), May 1997.
46
![Page 47: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/47.jpg)
[42] C. Wohlin, P. Runeson, M. Host, B. Regnell, and A. Wesslen. Experimentation in Software Engineering.
Kluwer Academic Publishers, Boston, MA, 2000.
[43] W. E. Wong, J. R. Horgan, S. London, and A. P. Mathur. Effect of test set minimization on fault
detection effectiveness. Software Practice and Experience, 28(4):347–369, April 1998.
[44] W.E. Wong, J.R. Horgan, S. London, and H. Agrawal. A study of effective regression testing in practice.
In Proceedings of the Eighth International Symposium on Software Reliability Engineering, pages 230–
238, November 1997.
47
![Page 48: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/48.jpg)
Appendix A: Additional Analyses of Significant Interactions
A.1 Regression Test Selection
Emp-serverSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups64 modified non-core entity 24.11 A64 retest-all 25.65 A32 modified non-core entity 32.17 A B32 retest-all 34.24 A B16 modified non-core entity 41.82 A B16 retest-all 49.29 A B8 modified non-core entity 55.04 A B4 modified non-core entity 80.21 B C8 retest-all 80.47 B C2 modified non-core entity 120.37 C D4 retest-all 139.84 D E1 modified non-core entity 180.06 E2 retest-all 258.41 F1 retest-all 505.15 G
Table 18: Emp-server, granularity * technique, test execution time, modified non-core entity and retest-all.
Emp-ServerSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups16 minimization 6.37 A32 minimization 6.70 A64 minimization 6.91 A8 minimization 7.52 A4 minimization 8.74 A2 minimization 8.97 A1 minimization 9.62 A64 retest-all 25.65 B32 retest-all 34.24 B16 retest-all 49.29 C8 retest-all 80.47 D4 retest-all 139.84 E2 retest-all 258.41 F1 retest-all 505.15 G
Table 19: Emp-Server, granularity * technique, test execution time, minimization and retest-all.
48
![Page 49: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/49.jpg)
BashSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups2 minimization 16.33 A1 minimization 18.44 A4 minimization 26.00 A16 minimization 39.56 A8 minimization 43.17 A B32 minimization 69.44 A B C64 minimization 72.22 A B C D64 retest-all 212.78 B C D E32 retest-all 230.67 C D E16 retest-all 240.83 D E8 retest-all 300.33 E4 retest-all 366.44 E F2 retest-all 520.06 F1 retest-all 782.22 G
Table 20: Bash, granularity * technique, test execution time, minimization and retest-all.
A.2 Test Suite Reduction
Emp-serverSource: Granularity * GroupingDependent Variable: Fault Detection EffectivenessGranularity Grouping Mean Homogeneous Groups8 Random 9.06 A2 Random 9.28 A B1 Random 9.28 A B1 Functional 9.28 A B64 Random 9.56 A B C32 Functional 9.56 A B C16 Random 9.72 A B C64 Functional 9.72 A B C2 Functional 9.83 B C32 Random 9.89 B C8 Functional 9.94 B C4 Functional 10.00 C4 Random 10.00 C16 Functional 10.00 C
Table 21: Emp-server, granularity * grouping, fault detection effectiveness.
49
![Page 50: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/50.jpg)
Emp-serverSource: Granularity * TechniqueDependent Variable: Fault Detection EffectivenessGranularity Technique Mean Homogeneous Groups1 GHS reduction 8.56 A8 GHS reduction 9.00 A B2 GHS reduction 9.11 A B C64 GHS reduction 9.44 B C D32 GHS reduction 9.56 B C D16 GHS reduction 9.72 C D64 retest-all 9.83 D32 retest-all 9.89 D16 retest-all 10.00 D2 retest-all 10.00 D4 retest-all 10.00 D4 GHS reduction 10.00 D1 retest-all 10.00 D8 retest-all 10.00 D
Table 22: Emp-server, granularity * technique, fault detection effectiveness, GHS reduction and retest-all.
Emp-serverSource: Granularity * Grouping * TechniqueDependent Variable: Fault Detection EffectivenessGranularity Grouping Technique Mean Homogeneous Groups8 Random GHS reduction 8.11 A2 Random GHS reduction 8.56 A B1 Random GHS reduction 8.56 A B1 Functional GHS reduction 8.56 A B32 Functional GHS reduction 9.33 B C64 Functional GHS reduction 9.44 B C64 Random GHS reduction 9.44 B C16 Random GHS reduction 9.44 B C64 Random retest-all 9.67 C2 Functional GHS reduction 9.67 C32 Functional retest-all 9.78 C32 Random GHS reduction 9.78 C8 Functional GHS reduction 9.89 C4 1 retest-all 10.00 C4 Functional retest-all 10.00 C8 Functional retest-all 10.00 C4 Functional GHS reduction 10.00 C16 Random retest-all 10.00 C16 Functional GHS reduction 10.00 C16 Functional retest-all 10.00 C8 Random retest-all 10.00 C32 Random retest-all 10.00 C4 Random GHS reduction 10.00 C2 Functional retest-all 10.00 C2 Random retest-all 10.00 C1 Functional retest-all 10.00 C1 Random retest-all 10.00 C64 Functional retest-all 10.00 C
Table 23: Emp-server, granularity * grouping * technique, fault detection effectiveness, GHS reduction andretest-all.
50
![Page 51: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/51.jpg)
Emp-serverSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups64 GHS reduction 16.17 A32 GHS reduction 17.53 A B16 GHS reduction 18.24 A B8 GHS reduction 23.20 A B C64 retest-all 25.65 B C4 GHS reduction 30.06 C D32 retest-all 34.24 D2 GHS reduction 36.97 D1 GHS reduction 48.55 E16 retest-all 49.29 E8 retest-all 80.47 F4 retest-all 139.84 G2 retest-all 258.41 H1 retest-all 505.15 I
Table 24: Emp-server, granularity * technique, test execution time, GHS reduction and retest-all.
Emp-serverSource: Granularity * Grouping * TechniqueDependent Variable: Test Execution TimeGranularity Grouping Technique Mean Homogeneous Groups64 Random GHS reduction 14.89 A32 2 GHS reduction 15.94 A64 Functional GHS reduction 17.45 A16 Functional GHS reduction 18.08 A16 Random GHS reduction 18.41 A32 Random GHS reduction 19.11 A8 Random GHS reduction 20.44 A B4 Random GHS reduction 24.68 A B C64 Random retest-all 25.62 A B C64 Functional retest-all 25.68 A B C8 Functional GHS reduction 25.96 A B C2 Random GHS reduction 32.60 B C D32 Functional retest-all 33.92 C D32 Random retest-all 34.56 C D4 Functional GHS reduction 35.44 C D2 Functional GHS reduction 41.33 D E16 Functional retest-all 48.53 E1 Random GHS reduction 48.55 E1 Functional GHS reduction 48.55 E16 Random retest-all 50.06 E8 Functional retest-all 80.22 F8 Random retest-all 80.72 F4 Functional retest-all 137.22 H4 Random retest-all 142.46 H2 Functional retest-all 253.53 I2 Random retest-all 263.30 I1 Random retest-all 505.15 J1 Functional retest-all 505.15 J
Table 25: Emp-server, granularity * grouping * technique, test execution time, GHS reduction and retest-all.
51
![Page 52: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/52.jpg)
BashSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups1 GHS reduction 68.44 A2 GHS reduction 97.67 A16 GHS reduction 131.56 A B4 GHS reduction 141.44 A B8 GHS reduction 160.33 A B32 GHS reduction 199.94 A B C64 GHS reduction 201.22 A B C64 retest-all 212.78 A B C32 retest-all 230.67 A B C16 retest-all 240.83 A B C8 retest-all 300.33 B C4 retest-all 366.44 C D2 retest-all 520.06 D1 retest-all 782.22 E
Table 26: Bash, granularity * technique, test execution time, GHS reduction and retest-all.
A.3 Test Case Prioritization
Emp-serverSource: Granularity * TechniqueDependent Variable: APFDGranularity Technique Mean Homogeneous Groups64 Coverage 92.22 A8 Coverage 94.79 AB16 Coverage 94.93 AB32 Coverage 96.12 ABC1 Coverage 96.70 BCD4 Coverage 97.87 CDE64 Optimal 98.41 CDE2 Coverage 99.04 DE32 Optimal 99.12 DE16 Optimal 99.60 E8 Optimal 99.79 E4 Optimal 99.89 E2 Optimal 99.94 E1 Optimal 99.97 E
Table 27: Emp-server, granularity * technique, APFD, Optimal and Coverage.
BashSource: Grouping * TechniqueDependent Variable: APFDGrouping Technique Mean Homogeneous GroupsFunctional Additional 86.13 ARandom Additional 89.20 BFunctional Optimal 98.69 CRandom Optimal 98.74 C
Table 28: Bash, grouping * technique, APFD, Optimal and Coverage.
52
![Page 53: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/53.jpg)
Emp-serverSource: Granularity * Grouping * TechniqueDependent Variable: APFDGranularity Grouping Technique Mean Homogeneous Groups64 Functional Coverage 90.59 A8 Random Coverage 92.90 AB64 Random Coverage 93.86 ABC16 Functional Coverage 94.45 ABCD16 Random Coverage 95.40 BCDE1 Random Coverage 95.48 BCDEF32 Functional Coverage 95.90 BCDEFG32 Random Coverage 96.33 BCDEFG8 Functional Coverage 96.68 BCDEFG4 Random Coverage 97.79 CDEFG1 Functional Coverage 97.91 CDEFG4 Functional Coverage 97.95 CDEFG64 Random Optimal 98.38 DEFG64 Functional Optimal 98.43 DEFG2 Functional Coverage 98.65 DEFG32 Functional Optimal 99.04 EFG32 Random Optimal 99.20 EFG2 Random Coverage 99.43 EFG16 Functional Optimal 99.60 EFG16 Random Optimal 99.60 EFG8 Random Optimal 99.79 FG8 Functional Optimal 99.79 FG4 Random Optimal 99.89 G4 Functional Optimal 99.89 G2 Functional Optimal 99.94 G2 Random Optimal 99.94 G1 Functional Optimal 99.97 G1 Random Optimal 99.97 G
Table 29: Emp-server, granularity * grouping * technique, APFD, Optimal and Coverage.
BashSource: Granularity * GroupingDependent Variable: APFDGranularity Grouping Mean Homogeneous Groups64 Functional 77.88 A32 Functional 83.10 A64 Random 83.26 A16 Random 91.76 B8 Random 92.88 BC16 Functional 94.44 BC32 Random 95.28 BC8 Functional 96.92 BC4 Random 97.00 BC4 Functional 97.41 BC2 Functional 98.54 BC1 Functional 98.57 BC2 Random 98.68 BC1 Random 98.93 C
Table 30: Bash, granularity * grouping, APFD, Optimal and Coverage.
53
![Page 54: On Test Suite Composition and Cost-Efiective …cse.unl.edu/~elbaum/papers/journals/tosem04.pdfregression testing cost-efiectiveness, with several implications for practice. 1 Introduction](https://reader030.fdocuments.us/reader030/viewer/2022040915/5e8dce54f28cad645e1f6b55/html5/thumbnails/54.jpg)
BashSource: Granularity * Grouping * TechniqueDependent Variable: APFDGranularity Grouping Technique Mean Homogeneous Groups64 Functional Additional 59.88 A32 Functional Additional 68.42 A64 Random Additional 70.22 A16 Random Additional 84.86 B8 Random Additional 86.43 BC16 Functional Additional 90.30 BCD32 Random Additional 93.02 BCD4 Random Additional 94.34 BCD8 Functional Additional 94.58 BCD4 Functional Additional 95.19 BCD64 Functional Optimal 95.88 CD64 Random Optimal 96.30 CD1 Functional Additional 97.23 CD2 Functional Additional 97.28 CD32 Random Optimal 97.53 D2 Random Additional 97.54 D32 Functional Optimal 97.77 D1 Random Additional 97.96 D16 Functional Optimal 98.58 D16 Random Optimal 98.67 D8 Functional Optimal 99.27 D8 Random Optimal 99.32 D4 Functional Optimal 99.63 D4 Random Optimal 99.65 D2 Functional Optimal 99.81 D2 Random Optimal 99.82 D1 Functional Optimal 99.90 D1 Random Optimal 99.90 D
Table 31: Bash, granularity * grouping * technique, APFD, Optimal and Coverage.
BashSource: Granularity * TechniqueDependent Variable: APFDGranularity Technique Mean Homogeneous Groups64 Diff-Coverage 66.97 A32 Diff-Coverage 75.66 AB4 Diff-Coverage 80.51 ABC16 Diff-Coverage 80.65 ABC8 Diff-Coverage 83.17 BCD1 Diff-Coverage 86.27 BCDE2 Diff-Coverage 92.44 CDE64 Optimal 96.09 DE32 Optimal 97.65 E16 Optimal 98.63 E8 Optimal 99.30 E4 Optimal 99.64 E2 Optimal 99.82 E1 Optimal 99.90 E
Table 32: Bash, granularity * technique, APFD, Optimal and Diff-Coverage.
54