On Test Suite Composition and Cost-Eﬁective...

On Test Suite Composition and Cost-Effective Regression Testing.∗

Gregg Rothermel†, Sebastian Elbaum‡, Alexey Malishevsky†, Praveen Kallakuri‡, Xuemei Qiu†

†School of Electrical Engineeringand Computer Science

Oregon State UniversityCorvallis, Oregon

{grother, malishal, qiuxu}@cs.orst.edu

‡Department of Computer Scienceand Engineering

University of Nebraska - LincolnLincoln, Nebraska

{elbaum, pkallaku}@cse.unl.edu

August 30, 2003

Abstract

Regression testing is an expensive testing process used to re-validate software as it evolves. Variousmethodologies for improving regression testing processes have been explored, but the cost-effectivenessof these methodologies has been shown to vary with characteristics of regression test suites. One suchcharacteristic involves the way in which test inputs are composed into test cases within a test suite.This article reports the results of controlled experiments examining the affects of two factors in testsuite composition — test suite granularity and test input grouping — on the costs and benefits ofseveral regression-testing-related methodologies: retest-all, regression test selection, test suite reduction,and test case prioritization. These experiments consider the application of several specific techniques,from each of these methodologies, across ten releases each of two substantial software systems, usingseven levels of test suite granularity and two types of test input grouping. The effects of granularity,technique, and grouping on the cost and fault-detection effectiveness of regression testing under the givenmethodologies are analyzed. This analysis shows that test suite granularity significantly affects severalcost-benefits factors for the methodologies considered, while test input grouping has limited effects.Further, the results expose essential tradeoffs affecting the relationship between test suite design andregression testing cost-effectiveness, with several implications for practice.

1 Introduction

As software evolves, test engineers regression test it to validate new features and detect whether new faults

have been introduced into previously tested code. Regression testing is important, but also expensive, so

many methodologies for improving its cost-effectiveness have been investigated. Among these methodologies

are four that involve reuse of existing test cases. The retest-all methodology [26, 30] re-uses all previously

developed test cases, executing them on the modified program. Regression test selection (e.g., [7, 36]) re-

uses test cases too, but selectively, focusing on subsets of existing test suites. Test case prioritization (e.g.,

[11, 39, 40, 44]) orders test cases so that those that are better at achieving testing objectives are run earlier

in the regression testing cycle. Finally, test suite reduction (e.g., [6, 17, 29]) attempts to reduce future

regression testing costs by permanently eliminating test cases from test suites.∗Portions of this research have been previously presented in [33].

1

The cost-effectiveness of specific techniques under these four methodologies varies with characteristics

of test suites [11, 37, 38]. One prominent factor in this variance involves the way in which test inputs are

composed into test cases within a test suite. For example:

• A test suite for a word processor might contain just a few test cases that start up the system, open a

document, issue hundreds of editing commands, and close the document, or it might contain hundreds

of test cases that each issue only a few commands.

• A test suite for a compiler might contain several test cases that each compile a source file containing

hundreds of language constructs, or hundreds of test cases that each compile source files containing

just a few constructs.

• A test suite for a class library might contain a few test drivers that each invoke dozens of methods, or

dozens of drivers that each invoke just a few methods.

These examples expose important choices in test suite design, and faced with such choices, test engineers

may wonder how best to proceed. Textbooks and articles on testing provide varying and sometimes contra-

dictory advice. Beizer [2, p. 51], for example, asserts: “It’s better to use several simple, obvious tests than

to do the job with fewer, but grander, tests.” Kaner et al. [23, p. 125] suggest that large test cases can save

time, provided they are not overly complicated, in which case simpler test cases may be more efficient. Kit

[25, p. 107] suggests that when testing valid inputs for which failures should be infrequent, large test cases

are preferable. Hildebrandt [20] argues that small test cases facilitate debugging. Bach [1] states that small

test cases cause fewer difficulties with cascading errors, but large test cases are better at exposing system

level failures involving interactions between software components.

Most of the foregoing statements refer to test case size, but the issues concerned are more complex. In

this article, we consider two specific characteristics of test suite composition: test suite granularity and test

input grouping. These characteristics pertain to the way in which test engineers group individual test inputs

into test cases within test suites. Test suite granularity pertains to the size of the test cases so grouped –

the number of inputs, or amount of input applied, per test case. Test input grouping pertains to the content

of test cases – the degree of hetero- or homogeneity among the inputs that compose a test case. (We define

these characteristics more precisely in Section 2, and provide precise measures for them in Section 3.2.1).

Despite the apparent importance of test suite composition and the apparent contradictions among state-

ments in the popular testing literature, in our search of the research literature we find little formal examina-

tion of the cost-benefits tradeoffs associated with test suite granularity and test input grouping. A thorough

investigation of these tradeoffs and the implications they hold for testing across the software lifecycle could

help test engineers design test suites that better support cost-effective regression testing.

We have therefore designed and performed a family of controlled experiments, examining the effects

of test suite granularity and test input grouping on the costs and benefits of the four regression-testing-

related methodologies mentioned above: retest-all, regression test selection, test suite reduction, and test

case prioritization. Our experiments consider the application of several techniques, under each of these

methodologies, across ten releases each of two substantial software systems, using seven different levels of

2

test suite granularity and two different types of test input grouping. We measure and analyze the effects of

granularity, technique, and grouping on the costs of regression testing the systems as they evolve, and on

the fault-detection effectiveness of that regression testing.

Our results show that test suite granularity significantly affects several cost-benefits factors for the

methodologies considered, while test input grouping has limited effects. Further, our results expose several

essential tradeoffs affecting the relationship between test suite design and regression testing cost-effectiveness,

with several implications for practice.

In the following section we review the issues and the previous literature related to this work. Section 3

presents our experiment design, results, and analysis. Section 4 discusses the implications of our results, and

Section 5 summarizes and comments on future work.

2 Background and Related Work

One could certainly study the effects of test suite composition on the cost-effectiveness of test suites, focusing

on the testing of initial versions of new software systems. Such a study could provide data on the cost-

effectiveness of various types of test development strategies relative to initial system releases, a context that

is certainly important.

On our view, however, such a study would overlook a central facet of software system development.

Successful software systems are seldom developed and tested just once; rather, they evolve, and are re-tested

repeatedly across their lifetimes. A testing methodology that is effective for an initial system release, but that

complicates subsequent regression testing of the system as it evolves, may be less cost-effective overall than

a methodology that is initially expensive but amortizes initial testing costs over subsequent, cost-effective,

regression testing runs.

A fundamental thesis behind this work, therefore, is that testing cost and effectiveness are best assessed

relative to systems across their lifecycles. This means, among other things, that we must assess testing

techniques and test design choices relative to their effects on regression testing.

For this reason, in this work, we study the effects of test suite granularity and test input grouping on

testing activities in relation to regression testing.

In the following subsections, we provide more detailed discussion of test suite granularity and test input

grouping, we describe the particular regression testing activities on which we focus, and we discuss related

work on these topics.

2.1 Test Suite Granularity and Test Input Grouping

Following Binder [4], we define a test case to consist of a pretest state of the system under test (including

its environment), a sequence of test inputs, and a statement of expected test results. We define a test suite

to be a set of test cases.

Definitions of test suite granularity and test input grouping are harder to come by, but the testing problem

we are addressing is a practical one, so we begin by drawing on examples.

3

Test engineers designing test cases for a system identify various testing requirements for that system, such

as specification items, code elements, or method sequences. Next, they must construct test cases that exercise

these requirements. An engineer testing a word processor might specify sequences of editing commands, an

engineer testing a compiler might create sample target-language programs, and an engineer testing a class

library might develop drivers that invoke methods. The practical questions these engineers face include how

many and which editing commands to include per sequence, how many and which constructs to include in

each target-language program, and how many and which methods to invoke per driver, respectively.

We wish to answer these questions, and the answers are likely to involve many factors. For example, if

the cost of performing setup activities for individual test cases dominates the cost of executing those test

cases, a test suite containing a few large test cases can be less expensive than a suite containing many small

test cases. Large test cases might also be better than small ones at exposing failures caused by interactions

among system functions. Small test cases, on the other hand, can be easier to use in debugging than large test

cases, because they reduce occurrences of cascading errors [1] and simplify fault localization [20]. Further, in

test cases composed of large numbers of test inputs, inputs occuring early in the test cases may prevent test

inputs that appear later in those test cases from exercising the requirements they are intended to exercise,

by causing subsequent test inputs to be applied from system states that differ from those intended.

In part, the foregoing examples involve test case size, a term used informally in [1, 2, 23, 25] to denote

notions such as the number of commands applied to, or the amount of input processed by, the program under

test, for a given test case. However, there is more than just test case size involved: when engineers increase

or decrease the number of requirements covered by each test case, this directly determines the number of

individual test cases that must be created to cover all the requirements. Thus, as expressed by Beizer [2],

the choice is not just between “large” and “small” tests, but between “several simple, obvious tests” and

“fewer, but grander, tests”.

The interaction of test case size and number of test cases is one plausible factor underlying the cost-

benefits tradeoffs described above. One phenomenon we wish to study in this article, then, involves the

effects that occur when test inputs are composed into specific size test cases in a test suite. We use the term

test suite granularity to describe a partition on a set of test inputs into a test suite containing test cases of

a given size. Section 3.2.1 presents a precise metric for this construct.

An additional factor that may influence the effects of choices in test suite design, however, involves the

relationship between the particular test inputs that are assembled into individual test cases. For example, a

typical approach in test development and automation is for test engineers to group together, into individual

test cases, test inputs that address similar functionality (for example, inputs related to a specific use case or

set of related functional requirements). This can be distinguished from approaches that group test inputs in

other ways, such as by engineer or team. We use the term test input grouping to describe this factor. Section

3.2.1 provides a precise metric for this construct.

As thus defined, test suite granularity concerns the sizes of individual test cases, but not their content,

and test input grouping concerns the content of individual test cases, but not their size. Together these

two terms represent test suite composition, but as we shall show, the two factors can be varied separately,

4

allowing us to examine both their individual and combined roles in affecting the cost and effectiveness of

regression testing methodologies.

Other definitions of test case, test case size, test suite granularity, and test input grouping than those

used in this work could also be of practical interest. Test engineers might choose to view the individual

inputs applied during a single invocation of a word processor, or the individual method invocations made

from within a class driver, as individual test cases, each with its own size. Also, in practice, test suites may

contain test cases of varying sizes and with varying logic underlying groupings. As we show in Section 3,

however, our definitions facilitate the controlled study of the cost-benefits tradeoffs outlined above, allowing

us to investigate questions of causality not otherwise amenable to study.

2.2 Regression Testing and Regression-Testing-Related Methodologies

Let P be a program, let P ′ be a modified version of P , and let T be a test suite developed for P . Regression

testing is concerned with validating P ′.

To facilitate regression testing, engineers typically re-use T , but new test cases may also be required to

test new functionality. Both reuse of T and creation of new test cases are important; however, it is test case

reuse that concerns us here, as it is the desire to re-use test cases that motivates most suggestions about

costs and benefits of test suite granularity. In particular, we consider four methodologies related to regression

testing and test reuse: retest-all, regression test selection, test suite reduction, and test case prioritization.1

2.2.1 Retest-all

When P is modified, creating P ′, test engineers may simply reuse all non-obsolete test cases in T to test P ′;

this is known as the retest-all technique [26]. (Test cases in T that no longer apply to P ′ are obsolete, and

must be reformulated or discarded [26].) The retest-all technique represents typical current practice [30],

and thus, serves as our control technique.

2.2.2 Regression Test Selection

The retest all technique can be expensive: rerunning all test cases may require an unacceptable amount of

time or human effort. Regression test selection (RTS) techniques (e.g., [5, 7, 14, 27, 36, 41]) use information

about P , P ′, and T to select a subset of T with which to test P ′. (For a survey of RTS techniques, see [35].)

Empirical studies of some of these techniques [7, 15, 34, 37] have shown that they can be cost-effective.

One cost-benefits tradeoff among RTS techniques involves safety and efficiency. Safe RTS techniques

(e.g. [7, 36, 41]) guarantee that, under certain conditions, test cases not selected could not have exposed

faults in P ′ [35]. Achieving safety, however, may require inclusion of a larger number of test cases than can

be run in available testing time. Non-safe RTS techniques (e.g. [14, 16, 27]) sacrifice safety for efficiency,

selecting test cases that, in some sense, are more useful than those excluded. A special case among non-safe

techniques involves techniques that attempt to minimize the selected test suite relative to a fixed set of1There are also several other sub-problems related to the regression testing effort, including the problems of automating

testing activities, managing testing-related artifacts, identifying obsolete tests, and providing test oracle support [19, 26, 30].We do not directly address these problems here, although our results could have implications worth considering for them.

5

coverage requirements and information on changes (e.g. [14]), seeking the lowest test execution cost possible

consistent with covering changed sections of code.

2.2.3 Test Suite Reduction

As P evolves, new test cases may be added to T to validate new functionality. Over time, T grows, and its

test cases may become redundant in terms of code or functionality exercised. Test suite reduction techniques2

[6, 17, 22, 29] address this problem by using information about P and T to permanently remove redundant

test cases from T , rendering later reuse of T more efficient. Test suite reduction thus differs from regression

test selection in that the latter does not permanently remove test cases from T , but simply “screens” those

test cases for use on a specific version P ′ of P , retaining unused test cases for use on future releases. Test

suite reduction analyses are also typically accomplished (unlike regression test selection) independent of P ′.

By reducing test-suite size, test-suite reduction techniques reduce the costs of executing, validating, and

managing test suites over future releases of the software. A potential drawback of test-suite reduction, how-

ever, is that removal of test cases from a test suite may damage that test suite’s fault-detecting capabilities.

Some studies [43] have shown that test-suite reduction can produce substantial savings at little cost to fault-

detection effectiveness. Other studies [38] have shown that test suite reduction can significantly reduce the

fault-detection effectiveness of test suites.

2.2.4 Test Case Prioritization

Test case prioritization techniques [11, 22, 39, 40, 44], schedule test cases so that those with the highest

priority, according to some criterion, are executed earlier in the regression testing process than lower priority

test cases. For example, testers might wish to schedule test cases in an order that achieves code coverage at

the fastest rate possible, exercises features in order of expected frequency of use, or increases the likelihood

of detecting faults early in testing.

Empirical results [11, 39, 44] suggest that several simple prioritization techniques can significantly im-

prove one testing performance goal; namely, the rate at which test suites detect faults. An improved rate of

fault detection during regression testing provides earlier feedback on the system under test and lets software

engineers begin addressing faults earlier than might otherwise be possible. These results also suggest, how-

ever, that the relative cost-effectiveness of prioritization techniques varies across workloads (programs, test

suites, and types of modifications).

Many different prioritization techniques have been proposed [10, 11, 39, 40, 44], but the techniques most

prevalent in literature and practice involve those that utilize simple code coverage information, and those

that supplement coverage information with details on where code has been modified. The latter approach

has been found efficient on extremely large systems at Microsoft [40], but the relative effectiveness of the

approaches has been shown to vary with several factors including characteristics of the test suite utilized

[13], further motivating experiments such as those reported in this article.2Test suite reduction has also been referred to, in the literature, as test suite minimization; however, the intractability of

the test suite minimization problem forces techniques to employ heuristics that may not yield minimum test suites; thus, weterm these techniques “reduction” techniques.

6

2.3 Related Work

Many articles [7, 9, 15, 24, 38, 43] have examined the costs and benefits of retest-all, regression test selection,

test case prioritization, and test case reduction techniques. Several textbooks and articles on testing [1, 2,

9, 20, 23, 25, 38] have discussed tradeoffs involving test suite granularity. None of this literature, however,

describes any formal or empirical examinations of these tradeoffs.

In [34, 37], test suite granularity is specifically treated as a factor in two studies of regression test

selection, and test suites constructed from smaller test cases are shown to facilitate selection. These studies,

however, measured only numbers of test cases selected, considered only safe RTS techniques, and omitted

consideration of test input grouping. In contrast, this article presents the results of controlled experiments

designed specifically to examine the impact of test suite granularity and test input grouping on the costs

and savings associated with several regression testing methodologies and techniques, across several metrics

of importance.

In [33], we presented the results of an initial set of controlled experiments examining the effects of test

suite granularity on the retest-all, regression test selection, test suite reduction, and test case prioritization

methodologies. The experiments reported in this article extend those experiments in the following ways:

• The experiments in [33] treated test suite granularity, program, and technique as independent variables;

these experiments expand the set of independent variables considered to include test input grouping.

• The experiments in [33] utilized six versions each of two subject software systems; these experiments

expand the subject pool to ten versions of each of these systems.

• The experiments in [33] utilized four levels of test suite granularity; these experiments expand this to

seven levels.

• These experiments examine an additional regression test selection technique and an additional test

case prioritization technique, each representing important classes of techniques not considered in [33].

• These experiments utilize improved test oracles, providing a new view on fault detection results.

• The analysis of the results of the experiments in [33] considered only main effects; the analysis of the

results of these experiments also considers significant interactions.

• The discussion of the results obtained in these experiments utilizes an additional measure of fault-

detection effectiveness not considered in [33].

• The discussion of results considers not only general tendencies, but also the particular findings and

impact of those findings within each methodology.

The net effect of these changes is an expansion of the external, construct, and conclusion validity of the

results reported in [33], and a more thorough understanding of the effects of test suite composition than was

achievable through the earlier experiments alone.

7

3 Experiments

Informally, our goal is to address the research question: “how do test suite granularity and test input grouping

affect the costs and benefits of regression testing methodologies?” More formally, we seek to evaluate the

following hypotheses (expressed as null hypotheses) for four methodologies — retest all, regression test

selection, test suite reduction, and test case prioritization — at a 0.05 level of significance:

H1 (test suite granularity): Test suite granularity does not have a significant impact on the

costs and benefits of regression testing techniques.

H2 (test input grouping): Test input grouping does not have a significant impact on the costs

and benefits of regression testing techniques.

H3 (technique): Regression testing techniques do not perform significantly differently in terms

of the selected costs and benefits measures.3

H4 (interactions): Test suite granularity and test input grouping effects across regression test-

ing techniques and programs do not significantly differ.

To test these hypotheses we designed several controlled experiments. The following subsections present,

for these experiments, our objects of analysis, independent variables, dependent variables and measures,

experiment setup and design, threats to validity, and data and analysis. Further discussion of the results

and their implications follows in Section 4.

3.1 Objects of Analysis: emp-server and bash

As objects of analysis we utilized ten releases each of two substantial C programs: emp-server and bash.

Emp-server is the server component of the open-source client-server internet game Empire. Emp-server

is essentially a transaction manager: its main routine consists of initialization code followed by an event

loop in which execution waits for receipt of a user command. Emp-server is invoked and left running on a

host system; a user communicates with the server by executing a client that transmits the user’s inputs

to it as commands. When emp-server receives a command, its event loop invokes routines that process the

command, then waits to receive the next command. As emp-server processes commands, it may return

data to the client program for display on the user’s terminal, or write data to a local database (a directory of

ASCII and binary files) that keeps track of the game’s state. The event loop and program terminate when a

user issues a “quit” command. Table 1 shows the numbers of functions and lines of executable code in each

of the ten versions of emp-server that we considered, and for each version after the first, the number of

functions changed for that version (modified or added to the version, or deleted from the preceding version).

Bash [32], short for “Bourne Again SHell”, is a popular open-source application that provides a command

line interface to multiple Unix services. Bash was developed as part of the GNU Project, adopting several

features from the Korn and C shells, but also incorporating new functionality such as improved command line

editing, unlimited size command history, job control, indexed arrays of unlimited size, and more advanced3This hypothesis has been tested in previous studies, and is included primarily for completeness and replication.

8

Changed LinesProgram Version Functions Functions of Code

emp-server 4.2.0 1,188 — 63,014emp-server 4.2.1 1,188 51 63,014emp-server 4.2.2 1,197 245 63,658emp-server 4.2.3 1,196 157 63,937emp-server 4.2.4 1,197 9 63,988emp-server 4.2.5 1,197 101 64,063emp-server 4.2.6 1,197 32 64,108emp-server 4.2.7 1,197 156 64,439emp-server 4.2.8 1,189 52 64,381emp-server 4.2.9 1,189 12 64,396bash 2.0 1,494 — 48,292bash 2.01 1,537 238 49,555bash 2.01.1 1,538 40 49,666bash 2.02 1,678 197 58,090bash 2.02.1 1,678 12 58,103bash 2.03 1,703 152 59,010bash 2.04 1,890 267 63,648bash 2.05a 1,942 411 65,319bash 2.05b 1,949 34 65,433bash 2.05 1,950 20 65,474

Table 1: Experiment Subjects

integer arithmetic. Bash is still evolving; on average two new releases have emerged per year over the last

five years. The ten versions of bash that we used were released from 1996 to 2001 (see Table 1). Each release

corrects faults, but also provides new functionality as evident by the increasing code size.

3.2 Variables and Measures

3.2.1 Independent Variables

Our experiments manipulated three independent variables: regression testing technique, test suite granular-

ity, and test input grouping.

Regression Testing Technique

For each regression testing methodology considered other than retest-all, we studied several techniques. In

selecting techniques we had three goals: (1) to include techniques that could serve as practical experimental

controls, (2) to include techniques that could easily be implemented by practitioners, and (3) to include

techniques that exemplify the primary categories of available techniques (and in so doing, reflect the primary

potential tradeoffs among techniques).

Retest-all. There is just one retest-all technique: run all of the non-obsolete test cases in T on P ′. We

investigate the effects of test suite granularity and test input grouping on this technique. (The retest-

all technique also serves as a control technique in our evaluations of RTS and test suite reduction

methodologies, as it represents standard practice when those methodologies are not employed.)

Regression test selection. We selected four RTS techniques, retest-all, modified entity, modified non-core

entity, and minimization:

9

• In this context retest-all is our control technique, representing the typical current practice of

selecting all non-obsolete test cases for re-execution.

• The modified entity technique [7] is a safe RTS technique: it selects test cases that exercise

functions, in P , that (1) have been deleted or changed in producing P ′, or (2) use variables or

structures that have been deleted or changed in producing P ′.

• The modified non-core entity technique [33] acts like the modified-entity technique, but ignores

“core” functions, defined as functions exercised by more than k% of the test cases in the test

suite. Following results of previous studies of technique effectiveness [3, 34], we set k to 80%.

This technique trades safety for savings in re-testing effort (selecting all test cases through core

functions may lead to selecting all of T ).

• The minimization technique [14] attempts to select a minimal set of test cases, from T , that

yields coverage of modified functions in P ′. This is necessarily an heuristic, as the technique uses

coverage information gathered from applying T to P to attempt to predict the functions that will

be covered in P ′.

Test suite reduction. We selected two test suite reduction techniques, no reduction and GHS reduction.

• The no reduction technique, equivalent to retest-all, represents current typical practice and serves

as our control.

• The GHS reduction technique is an heuristic presented by Gupta, Harrold, and Soffa [17] that

attempts to produce suites that are minimal for a given coverage criterion; we used a function

coverage criterion.

Test case prioritization. We selected three test case prioritization techniques: additional function cov-

erage, additional modified-function coverage, and optimal prioritization. These are described in detail

in [39], we summarize them here.

• Additional function coverage prioritization iteratively selects a test case that yields the greatest

function coverage, then adjusts the coverage information on subsequent test cases to indicate their

coverage of functions not yet covered, and then repeats this process until all functions covered by

at least one test case have been covered. The process then iterates on the remaining test cases.

• Additional modified-function coverage prioritization acts like additional function coverage prior-

itization, except that it initially attends only to functions that have been modified; after all test

cases executing one or more modified functions have been placed in the order, additional function

coverage prioritization is applied to the remaining test cases.

• Optimal prioritization uses information on which test cases in T reveal faults in P ′ to find an

approximate optimal ordering for T . Though not a practical technique (in practice we do not

know which test cases reveal which faults beforehand), this technique provides an upper bound

on prioritization benefits.

10

Test Suite Granularity

To investigate the impact of test suite granularity on the costs and benefits of regression testing techniques,

we needed to obtain test suites of varying granularities, while controlling for other factors that might affect

our dependent measures.

We considered two approaches for doing this. The first approach is to obtain or construct test suites

for a program, partition them into subsets according to size, and compare the results of executing these

different subsets. A drawback of this approach, however, is that it will not let us determine whether a causal

relationship exists between test suite granularity and measures of costs or benefits, because it does not

control for other factors that might influence those measures. To see this, suppose that T can be partitioned

into two subsets, T1 and T2, where T1 contains test cases of size less than s and T2 contains test cases of

size greater than or equal to s. Suppose that we compare the costs or benefits of utilizing T1 and T2 and

find that they differ. In this case, we cannot determine whether this difference was caused by test suite

granularity or by differences in the number or type of inputs applied in T1 and T2. For example, the types

of functionality exercised by the inputs in T2 might happen to include all functionality modified to create

P ′, causing differences in performance between the two subsets to occur for reasons other than test case

granularity.

The second approach that we considered is to construct test suites of varying granularities by sampling

a single pool or “universe” of test grains. A test grain is a smallest input that could be used as a test case

(applied from a start state and producing a checkable output) for a target program. A sampling procedure can

select test grains to create test cases of different sizes: a test case of size s consists of s test grains. Applying

this sampling procedure randomly and repeatedly to a universe of n test grains, without replacement, until

no test grains remain (partitioning the universe into n/s test cases of size s, and possibly one smaller test

case), yields a test suite of granularity level s. Repeating this procedure for each of several values of s

provides test suites of different granularity levels that can be compared controlling for differences in types

and numbers of inputs.

We chose this second approach, and employed seven granularity levels: 1, 2, 4, 8, 16, 32 and 64, which

we refer to as G1, G2, G4, G8, G16, G32 and G64, respectively. To facilitate discussion, when referring to

granularity levels, we refer to test suites employing lower granularity level numbers as fine granularity test

suites, and test suites employing higher granularity level numbers as coarse granularity test suites.

Test Input Grouping

In our procedure for constructing test suites of different granularities, applying the sampling procedure

repeatedly to a universe of n test grains and sampling randomly across the whole universe each time (without

replacement) creates random grouping test cases. Such a grouping strategy, however, may not reflect the

way in which inputs are grouped into test cases in practice, and thus we also considered a second strategy

for grouping test inputs, which creates functional grouping test cases.

Functional grouping test cases are composed (to the extent possible) of inputs that exercise the same

functionality. To create functional grouping test cases, we first separated the test grains in the test universe U

11

for each program P into “buckets”, where each bucket Bk contains the test grains in U targetting functionality

k in P . Given these buckets, we considered two approaches for creating functional grouping test cases of

granularity level s:

• From within each bucket, randomly select groups of s test grains without replacement until fewer than

s test grains remain in the bucket. Do this for each bucket. Collect any test grains remaining in any

buckets into a single pool, and from them, randomly select groups of size s test cases from this pool

until all have been selected.

• From within each bucket, randomly select groups of s test grains without replacement until fewer than

s test grains remain in the bucket. If any test grains remain in that bucket, let them constitute one

final group (of size less than s). Do this for each bucket.

The difference between these two approaches lies in their handling of test cases that remain in buckets

after the maximum possible number of groups of size s have been selected from those buckets. The first

approach has the drawback that, depending on the number and sizes of buckets, it may create a certain

number of functionally non-homogeneous test cases. The second approach has the drawback that it might

yield a large number of test cases of size less than s at each granularity level (potentially as many as one per

bucket). The presence of test cases of different sizes would make it impossible to draw conclusions about the

effects of granularity: we need to control for the number and size of test cases created at each granularity

level.

Thus, we selected the first approach as our grouping strategy. This strategy provides us with a set of test

cases, at each granularity level, equivalent in size to the set of test cases obtained with the random grouping

strategy, and lets us draw conclusions about the potential influence of functional grouping on granularity

effects. In interpreting our results we take care to consider functional non-homogeneity among our test cases.

3.2.2 Dependent Variables and Measures

To investigate our hypotheses we need to measure the costs and benefits of the various regression-testing-

techniques considered. To do this we constructed three models. Our first two models assess the costs and

benefits of retest-all, regression test selection and test case reduction, and our third model assesses the

benefits of test case prioritization.

Savings in Test Execution Time

Regression test selection and test suite reduction techniques achieve savings by reducing the number of test

cases that need to be executed on P ′, thereby reducing the effort required to retest P ′. The use of different

test suite granularities and test input groupings may also affect the savings in test execution and validation

time that can be achieved by selection, reduction, and retest-all. To evaluate these effects, we measure the

time required to execute and validate the outputs of the test cases in test suites, selected test suites, and

reduced test suites, across different granularities and groupings.

12

Costs in Fault-Detection Effectiveness

One potential cost of regression test selection and test suite reduction is the cost of missing faults that

would have been exposed by test suites prior to selection or reduction. Missed faults could also occur due to

differences in test suite granularity or test input grouping, for these techniques and the retest-all technique.

Costs in fault-detection effectiveness can be measured by studying programs containing known faults.

When dealing with single faults, one common fault-detection effectiveness measure [15, 21] estimates, for

each test case t, whether t detects fault f in P ′, by applying t to two versions of P ′, one that contains f

and one that does not. If the outputs of P and P ′ (program outputs and contents of relevant external files)

differ on t, t is assumed to reveal f . Given this approach, the fault-detection effectiveness for a specific test

suite T can be measured by considering fault-detection effectiveness results for each test case t ∈ T .

In our experiments, however, we wish to study programs containing multiple faults. When P ′ contains

multiple faults it is not sufficient to note which test cases cause P and P ′ to produce different outputs, we

must also determine which test cases could contribute to revealing which faults. One way to do this [24] is

to instrument P ′ such that when t is run on P ′ we can determine, for each fault f in P ′, whether: (1) t

reaches f , (2) t causes a change in data state following execution of f , and (3) the output of P ′ on t differs

from the output of P on t.

One drawback of this approach is that it can underestimate the faults that could be found in practice

with t. To see this, suppose that P ′ contains faults f1 and f2, which can each be detected by t if present

alone. Suppose, however, that when f1 and f2 are both present in P ′, f1 prevents t from reaching f2. This

approach would suggest that t cannot detect f2. In a debugging process, however, an engineer might detect

and correct f1, and then on re-running t on the (partially) corrected P ′, detect f2. A second drawback of

this approach is that testing for data state changes can be extremely difficult in programs that manipulate

enormous data spaces, such as those we use in these experiments.

For these reasons, we chose a different approach. We activated each fault f in P ′ individually, executed

each test case t (at each granularity level) on P ′, and determined whether t detects f singly by noting

whether it causes P and P ′ to produce different outputs. We then assumed that detection of f when present

singly implies detection of f when present in combination with other faults.

This approach avoids the drawbacks of the first: it captures the results of an incremental fault-correction

process without requiring detection of data state changes. The approach may overestimate fault detection,

however, in cases where multiple faults would actually mask each other’s effects, causing no failures to occur

on t. We investigated the possible magnitude of this error in our study by also executing, at each granularity

level, all test cases on versions with all seeded faults activated, and measuring the extent to which test cases

that caused single-fault versions to fail did not cause multi-fault versions to fail.4 The data showed that for

emp-server, across all versions and granularities, masking occurred on only 339 of 70,992 test cases (0.48%),

and for bash, across all versions and granularities, it occurred on only 5 of 41,742 test cases (0.012%). We

thus considered masking a nuisance variable, posing only a minor threat to the validity of our experiments.4This check does not eliminate the possibility that some subset of the faults in a multi-fault version might mask one another,

and be undetected by test case t in that version even though detected singly by t; however, it is not computationally feasibleto check for this possibility.

13

B. APFD for prioritized suite T1 C. APFD for prioritized suite T2 D. APFD for prioritized suite T3A. Test suite and faults exposed

0.2 0.4 0.6 0.8 1.0

0

0

10

20

30

40

50

60

70

80

90

100

Test Suite Fraction

Test Case Order: C-E-B-A-D

Per

cent

Det

ecte

d F

aults

0.2 0.4 0.6 0.8 1.00

10

20

30

40

50

60

70

80

90

Test Case Order: E-D-C-B-A

100

0

Test Suite Fraction

Per

cent

Det

ecte

d F

aults

APFD = 84%APFD = 64%

1 2 3 4 5 6 7 8 9 10x x x xx x x x x x x x x x x

ABCDE

test fault

0.6 0.8

10

20

30

40

50

60

70

80

90

Test Suite Fraction

100

0

0 0.2 0.4 1.0

Test Case Order: A-B-C-D-E

Per

cent

Det

ecte

d F

aults

APFD = 50%

Figure 1: Examples illustrating the APFD metric.

Savings in Rate of Fault Detection

The test case prioritization techniques we consider have a goal of increasing a test suite’s rate of fault

detection. We wish to determine whether test suite granularity and test input grouping affect the ability of

prioritization technique’s to achieve this goal. To measure rate of fault detection, we use a metric APFD,

introduced for this purpose in [39], that measures the weighted average of the percentage of faults detected

over the life of a test suite. APFD values range from 0 to 100; higher numbers imply faster (better) fault

detection rates. More formally, let T be a test suite containing n test cases, and let F be a set of m faults

revealed by T . Let TFi be the index of the first test case in ordering T ′ of T that reveals fault i. The APFD

for test suite T ′ is given by the equation:

APFD = 1 − TF1 + TF2 + ... + TFm

nm+

12n

To obtain an intuition for this metric, consider an example program with 10 faults and a test suite of

5 test cases, A through E, with fault detecting abilities as shown in Figure 1.A. Suppose we place the test

cases in order A–B–C–D–E to form prioritized test suite T1. Figure 1.B shows the percentage of detected

faults versus the fraction of T1 used. After running test case A, 2 of the 10 faults are detected; thus 20%

of the faults have been detected after 0.2 of T1 has been used. After running test case B, 2 more faults are

detected and thus 40% of the faults have been detected after 0.4 of the T1 has been used. The area under

the curve represents the weighted average of the percentage of faults detected over the life of the test suite.

This area is the prioritized test suite’s average percentage faults detected metric (APFD); the APFD is 50%

in this example.

Figure 1.C reflects what happens when the order of test cases is changed to E–D–C–B–A, yielding a

“faster detecting” suite than T1 with APFD 64%. Figure 1.D shows the effects of using a prioritized test

suite T3 whose test case order is C–E–B–A–D. By inspection, it is clear that this order results in the

earliest detection of the most faults and illustrates an optimal order, with APFD 84%.

14

emp-server bash

G1 1985 1168G2 993 584G4 497 292G8 249 146G16 125 73G32 63 37G64 32 19

Table 2: Test Cases per Granularity Level

3.3 Experiment Setup

3.3.1 Test Cases and Test Automation

To examine our research question we required test cases for emp-server and bash. These test cases needed

to be realistic, but also needed to facilitate the controlled investigation of the effects of test suite granularity

and test input grouping following the methodology outlined in Section 3.2.1. The approaches we used to

create and automate these test cases, which differed between our programs, were as follows.

Emp-server Test Cases and Test Automation

No test cases were available for emp-server. To construct test cases we used the Empire information files,

which describe the 196 commands recognized by emp-server and the parameters and environmental effects

associated with each. We treated these files as informal specifications for system functions and used them,

together with the category partition method [31], to construct a suite of test cases for emp-server that

exercise each parameter, environmental effect, and erroneous condition described in the files.

We deliberately created the smallest test cases possible, each using the minimum number of commands

necessary to cover its target requirement. Each test case consists of a sequence of between one and six lines

of characters (average 1.2 lines per test case), and constitutes a sequence of inputs to the client, which the

client passes to emp-server. Because the complexity of commands, parameters, and effects varies widely

across the various Empire commands, this process yielded between one and 38 test cases for each command,

and ultimately produced 1985 test cases. These test cases constituted our test grains, as well as our test

cases at granularity level G1. We then used the two sampling procedures described in Section 3.2.1 to create

random and functional grouping test suites at granularity levels G2, G4, G8, G16, G32, and G64, the sizes

of which are shown in Table 2.

The test cases for emp-server fell naturally into buckets distinguished by command, yielding 196 buckets

with an average size of 12 test cases apiece. No buckets had size greater than 64, and few had sizes greater than

16. Thus, test cases created by our sampling procedure for emp-server become less functionally homogeneous

as granularity level increases (see the discussion of this issue in Section 3.2.1.) Table 3 illustrates, for each

granularity level, the percentage of purely functionally homogenous test cases present in the functional

grouping test suites at that level. When analyzing our results we take care to consider this data.

To execute and validate test cases automatically, we created test scripts. Given test suite T , for each

test case t in T these scripts: (1) initialize the Empire database to a start state; (2) invoke emp-server;

15

Program G2 G4 G8 G16 G32 G64

emp-server 95.0 89.0 72.0 35.0 12.0 00.0bash 95.0 98.0 95.5 90.4 78.4 63.2

Table 3: Percentages of Purely Homogeneous Test Cases Present in Functional Groupings.

(3) invoke a client and issue the sequence of inputs that constitutes the test case to the client, saving all

output returned to the client; (4) terminate the client; (5) shut down emp-server; (6) save the contents of

the database for use in validation; and (7) compare saved client output and database contents with those

archived for the previous version, using a refined version of the Unix “diff” utility. By design, this process

lets us apply (in step 3) all of the test inputs contained in a test case, at all granularity levels.

Bash Test Cases and Test Automation

Each version of bash that we utilized had been released with a test suite, composed of test cases from previous

versions and new test cases designed to validate added functionality. We could not directly use these suites

for our experiment, because they were composed strictly of large test cases, each exercising whole functional

components. Further, the test suites executed, on average, only 33% of the functions in bash.

We thus created regression test suites for bash as follows. First, we partitioned each large test case that

came with bash release 2.0 into the smallest possible test grains. (We used the test cases from release 2.0

because they all function across all releases, whereas test cases added on subsequent releases do not function

on earlier ones, and a uniform application of test cases across all versions is needed to facilitate comparison.)

Second, to exercise functionality not covered by the original test suite, we created additional small test cases

by using the reference documentation for bash [32] as an informal specification.

The resulting test suite contains 1168 test cases, exercising an average of 64% of the functions across all

the versions. Each test case in the new test suite contains between one and 54 lines. Each line constitutes an

instruction consisting of bash or Expect [28] commands5 that can be executed on an instance of bash. The

1168 test cases constituted our test grains, and test cases at granularity level G1. As with emp-server, we

then followed the procedure described in Section 3.2.1 to create random and functional grouping test suites

at granularity levels G2, G4, G8, G16, G32, and G64, as reported in Table 2.

As with emp-server, our sampling procedure, applied to bash, did create some test cases that were not

homogeneous. For bash, however, the number of buckets identified (18) was far smaller, and average bucket

size (64) much larger, than for emp-server. Thus, functional grouping test cases were more frequently

functionally homogeneous for bash than for emp-server (see Table 3). When analyzing our results we take

care to consider this fact.

3.3.2 Faults

We wished to evaluate the performance of regression-testing-related methodologies with respect to detection

of regression faults – faults created in a program version as a result of the modifications that produced

that version. Emp-server and bash were not equipped, however, with fault logs of detail sufficient to let5Expect scripts were used for test cases exercising features of bash that required interaction.

16

us locate actual regression faults (a problem typical in the use of open-source software in experimentation).

Thus, following a procedure described in [21], we seeded faults. We asked several graduate and undergraduate

computer science students, each with at least two years experience programming in C and unacquainted with

the details of this study, to become familiar with the programs and insert regression faults into the versions.

The fault seeders were told to insert faults that were as realistic as possible based on their experience with

real programs, and that involved code deleted from, inserted into, or modified in the versions.

To further direct their efforts, the fault seeders were given the following list of types of faults to consider:

• Faults associated with variables, such as with definitions of variables, redefinitions of variables, deletions

of variables, or changes in values of variables in assignment statements.

• Faults associated with control flow, such as addition of new blocks of code, deletions of paths, redefi-

nitions of execution conditions, removal of blocks, changes in order of execution, new calls to external

functions, removal of calls to external functions, addition of functions, or deletions of functions.

• Faults associated with memory allocation, such as not freeing allocated memory, failing to initialize

memory, or creating erroneous pointers.

Given ten potential faults seeded in each version of each program, we activated these faults individually,

and executed the test suites (at each granularity level) for the programs to determine which faults could

be revealed by which test cases, following the process outlined in Section 3.2.2. We excluded any potential

faults that were not detected by any test cases at any granularity level: such faults are meaningless to our

measures and cannot influence our results. We also excluded any faults that, at every granularity level, were

detected by more than 80% of the test cases; our assumption was that such easily detected faults would be

detected by test engineers during their unit testing of modifications (only five faults fell into this category).

Excluding faults detected by greater than 80% of the test cases in some, as opposed to every, level would

be inappropriate: the exclusion rule must be uniform across levels to avoid biasing results in favor of faults

that are detected differently at different levels. When this process was complete, 159 faults remained across

all versions of both programs.

3.3.3 Additional Instrumentation

To perform our experiments we required additional instrumentation. Our test coverage and control-flow

graph information was provided by the Aristotle program analysis system [18] and by the Clic instrumentor

and monitor [12]. We created test case prioritization, test suite reduction, and regression test selection tools

implementing the techniques described in Section 3.2.1. We used Unix utilities and direct inspection to

determine modified functions, or functions using modified structures.

All timing-related data was gathered on a SunUltra 60 with 512 MB of memory. While timing data was

being collected, our testing processes were the only active user processes on the machines.

17

3.4 Experiment Design and Analysis Strategy

To address our hypotheses we designed four sets of experiments for each program, each with the same format.

These experiments evaluate the hypotheses for retest-all, regression test selection, test suite reduction, and

test case prioritization, respectively. In addition, each experiment has three factors with multiple levels to

ensure unbiased treatment assignment. We employ a Randomized Factorial (RF) design that has 2 levels for

grouping strategy, 7 levels for granularity, and a varying number of techniques depending on the particular

experiment. Each design cell has nine observations, corresponding to each of the versions (after the base

version) from each program under each treatment combination. These versions constitute random effects

that we do not control, and we consider them samples from a population of program versions.

The choice of a factorial design was based on the power of analysis offered by its treatment combinations,

which lets us interpret not only the main factors but also their interactions. The incorporation of three factors

was aimed at decreasing the variability of the results by controlling more independent variables, while at the

same time increasing the generalizability of the results by observing various scenarios that might be present

in the real world. We analyze emp-server and bash separately to reduce the impact of program related

factors that we did not fully control (e.g. software evolution, differences in test suites) on the results.

From the standpoint of empirical methodologies, it is interesting to note that such a factorial design

is often avoided in other disciplines due to the costs of obtaining “subjects” for all possible combinations

of independent variables. Since our “subjects” were programs and we had automated a large part of the

experiment, we were able to gather the data necessary to comply with such a design. Still, given the effort

involved in preparing program versions (ranging, approximately, from 80 to 300 hours per version) we wanted

to detect meaningful effects with a minimal number of invested resources. We decided to conservatively

determine sample size by doubling the number of versions used in the first instantiation of this study [33]

where significance was detected for at least one of the factors we are studying here.

3.5 Threats to Validity

Any controlled experiment is subject to threats to validity, and these must be considered in order to assess

the meaning and impact of results (see [42] for a general discussion of validity evaluation and a threats

classification). In this section we describe the internal, external, construct, and conclusion threats to the

validity of these experiments, and the approaches we used to limit their impact.

3.5.1 Internal Validity

To test our hypotheses we had to conduct experiments requiring a large number of processes and tools.

Some of these processes (e.g., fault seeding) involved programmers and some of the tools were specifically

developed for the experiments, all of which could have added variability to our results increasing threats to

internal validity. We used several procedures to control these sources of variation. For example, the fault

seeding process was performed following a specification so that each programmer operated in a similar way,

and it was performed in two locations using different groups of programmers. Also, we validated new tools

18

by testing them on small sample programs and test suites, refining them as we targeted the larger programs,

and cross validating them across labs.

Having only one test suite for each test input grouping type at each granularity level per program is also a

potential threat to internal validity. Although the use of multiple test suites would have been preferable, the

expense of creating such suites was prohibitive. Our process for generating coarser granularity test suites,

however, involved randomly selecting and joining test grains, reducing the chances of bias caused by test

suite composition.

Our handling of masking effects, described in Section 3.2.2, might constitute a further threat to internal

validity; however, as noted there, our analysis suggests that such effects occur infrequently among the test

cases we utilized.

3.5.2 External Validity

Three issues affect the generalization of our results. The first issue is the quantity and quality of programs

studied. Although using only two programs lessens the external validity of the results, the relatively consistent

results we obtain for bash and emp-server suggest that the results may generalize. Further, we are able to

study a relatively large number of actual, sequential releases of these systems. Regarding program quality,

there is a large population of C programs of similar size. For example, the linux RedHat 7.1 distribution

includes source code for 394 applications; the average size of these applications is 22,104 non-comment lines

of code, and 19% have sizes between 25 and 75 KLOC, similar to the programs studied in our experiment.

Nevertheless, replication of these studies on other programs could increase the confidence in our results, and

help us investigate other factors.

The second issue involves fault representativeness. Our fault seeding process helped us control for threats

to internal validity that must be controlled in order to examine causal factors; however, faults and fault

patterns may differ in practice, and additional studies of additional fault populations are needed.

The third limiting factor is test process representativeness. Although the random and functional grouping

procedures we employed to obtain coarser granularity test suites are powerful in terms of control, they

constitute simulations of the testing procedures used in industry, and this might also impact the generalization

of our results. Complementing these controlled experiments with case studies on industrial test suites, though

sacrificing internal validity, could be helpful.

3.5.3 Construct Validity

The three dependent measures that we have considered are not the only possible measures of the costs and

benefits of regression testing methodologies. Our measures ignore the human costs that can be involved in

executing, auditing and managing test suites. Our measures do not consider debugging costs such as the

difficulty of fault localization, which could favor fine granularity test suites [20]. Our measures also ignore the

analysis time required to select or prioritize test cases, or reduce test suites. Previous work [34, 38, 39] has

shown, however, that for the techniques considered, either analysis time is much smaller than test execution

time, or analysis can be accomplished automatically in off-hours prior to the critical regression testing period

(thus, having no effect on cost-benefits).

19

3.5.4 Conclusion Validity

The number of programs and versions we considered was large enough to show significance for most of the

techniques we studied in most, but all, cases. Although the use of more versions would have increased the

power of the experiment, the average cost of preparing each version ranged from 80 to 300 hours, limiting

the cost-effectiveness of taking additional observations.

3.6 Data and Analysis

In the following sections we investigate the effects of test suite granularity and grouping strategy on our four

regression testing methodologies, in turn, employing descriptive and inferential statistics.

3.6.1 Retest-All

We begin by exploring the impact of test suite granularity and grouping strategy on the retest-all technique.

Figure 2 summarizes the fault detection effectiveness (leftmost pair of graphs) and test execution time

(rightmost pair of graphs) observed per program as granularity level increases, for both grouping strategies.

Each graph contains seven data points per program, with each point representing the average, across all nine

modified versions of the given program, of the metric (fault detection effectiveness or test execution time)

being graphed. We join the data points with lines to assist interpretation.

The leftmost pair of graphs show that the fault detection effectiveness of the test suites remained nearly

constant for both programs, independent of changes in granularity level or grouping strategy. In total, only

three cases occurred in which faults detected at lower granularity levels were lost at granularity level G32,

and only two cases occurred in which faults detected at lower granularity levels were lost at granularity level

G64 (too few to be visible in the graphs). The test suites for the programs, used in their entirety, were

almost always powerful enough — across all granularities and on all versions — to detect all of the faults in

the programs. We will have more to say about this in Sections 4.1 and 4.5, in our discussion of results.

The rightmost pair of graphs show that test execution time decreased as granularity level increased,

independent of grouping strategy or program. For example, under the random grouping strategy, from

granularity level G1 to granularity level G64, test execution time decreased, for bash, from 782 minutes to

222 minutes, and for emp-server, from 505 minutes to 26 minutes.

We formally investigated these tendencies relative to our hypotheses by performing an analysis of variance

(ANOVA) for each program. The presentation of the ANOVA results includes the sources of variation

considered, and for each program, the sum of squares, degrees of freedom, mean squares, F value, and

p-value for each source. Because we set alpha to 0.05, and the p-value represents the smallest level of

significance that would lead to the rejection of a null hypothesis, we reject an hypothesis when p is less

than alpha. The results (Table 4) are consistent for both programs, indicating that granularity level, but

not grouping strategy, significantly affected execution time. The data showed no evidence of significant

interactions between the independent variables for either program.

20

Random

1 2 4 8 16 32 640

20

40

60

80

100%

Fa

ults D

ete

cte

d

Functional

1 2 4 8 16 32 64

Random

1 2 4 8 16 32 640

200

400

600

800

1000

Exe

cu

tio

n T

ime

(m

inu

tes)

Functional

1 2 4 8 16 32 64

emp-server

bash

Figure 2: Fault detection effectiveness for random and functional grouping strategies (leftmost pair ofcolumns) and test execution time for random and functional grouping strategies (rightmost pair of columns)for the retest-all technique, across granularity levels (x-axis), averaged across versions.

Technique: retest-all

Variable: Test execution time.Emp-server Bash

Source SS DF MS F p SS DF MS F pGranularity 3268098 6 544683 5693.41 0.00 4635964 6 772661 19.13 0.00Grouping 199 1 199 2.08 0.15 131338 1 131338 3.25 0.07Granularity*Grouping 367 6 61 0.64 0.70 70586 6 11764 0.29 0.94Error 10715 112 96 4522624 112 40381

Table 4: Retest-all ANOVA.

3.6.2 Regression Test Selection

To facilitate the comparison of regression test selection techniques to each other and to the retest-all tech-

nique, we depict the data on these techniques together in Figure 3. The graphs in the first row present

results for the retest-all technique, and the other rows present results for the three RTS techniques.

21

rete

st-

all

Random

1 2 4 8 16 32 640

20

40

60

80

100

% F

aults D

ete

cte

d

Functional

1 2 4 8 16 32 64

Random

1 2 4 8 16 32 640

200

400

600

800

1000

Execution T

ime (

min

ute

s)

Functional

1 2 4 8 16 32 64

modifie

d e

ntity

Random

1 2 4 8 16 32 640

20

40

60

80

100

% F

aults D

ete

cte

d

Functional

1 2 4 8 16 32 64

Random

1 2 4 8 16 32 640

200

400

600

800

1000

Execution T

ime (

min

ute

s)

Functional

1 2 4 8 16 32 64

modifie

d n

on-c

ore

entity

Random

1 2 4 8 16 32 640

20

40

60

80

100

% F

aults D

ete

cte

d

Functional

1 2 4 8 16 32 64

Random

1 2 4 8 16 32 640

200

400

600

800

1000

Execution T

ime (

min

ute

s)

Functional

1 2 4 8 16 32 64

min

imiz

ation

Random

1 2 4 8 16 32 640

20

40

60

80

100

% F

aults D

ete

cte

d

Functional

1 2 4 8 16 32 64

Random

1 2 4 8 16 32 640

200

400

600

800

1000

Execution T

ime (

min

ute

s)

Functional

1 2 4 8 16 32 64

emp-server bash

Figure 3: Fault detection effectiveness for random and functional grouping strategies (leftmost pair ofcolumns) and test execution time for random and functional grouping strategies (rightmost pair of columns)for retest-all and RTS techniques, across granularity levels (x-axis), averaged across versions.

22

As the graphs indicate, the modified entity technique exhibited the same trends as the retest-all technique,

retaining fault detection effectiveness across granularity levels, and exhibiting a large reduction in the amount

of time required to re-execute the test suite as granularity level increased. The reason for this behavior is

that the location of changes in these particular program versions caused this safe RTS technique to require

execution of all existing test cases, because all test cases traversed code changed for the new version.

The modified non-core entity technique displayed different behavior. With this technique, for both

grouping strategies and at several granularity levels, faults were left undetected. For the random grouping

strategy, fault-detection effectiveness increased, from granularity level G1 to level G64, by approximately

14% for emp-server and 10% for bash. For the functional grouping strategy this same tendency occurred

for emp-server, but not for bash, for which fault-detection effectiveness varied widely across granularity

levels.

Fault-detection effectiveness results ran contrary to our intuitions; we had expected fault-detection ef-

fectiveness for bash to increase as granularity level increased, for the modified non-core entity technique,

because the technique excludes fewer test cases at higher granularity levels than at lower ones. Further

analysis of the data suggests that this difference between bash and emp-server arose due to differences in

the difficulties of exposing the faults in the programs. All but one of bash’s faults were exposed by fewer than

1% of that program’s granularity level 1 test cases, whereas only 23% of emp-server’s faults were exposed

by fewer than 1% of that program’s granularity level 1 test cases. We return to this issue in Section 4.

Test execution time with the modified non-core entity technique decreased as granularity level increased,

though by a smaller amount than occurred for the retest-all and modified-entity techniques. This difference

is due to the fact that the modified non-core entity technique selects fewer test cases at lower granularity

levels than at higher ones (in general, a given fine-granularity test case is less likely to encounter changes than

a given coarse-granularity test case.) For example, for emp-server under the random grouping strategy, the

modified non-core entity technique selected on average 35% of the test cases at granularity level G1, 68% at

level G4, and 96% at level G64.

We also observe that at granularity levels G2 through G8 on emp-server, and G4 through G32 on bash,

the functional grouping strategy appears to be associated with somewhat lower test execution times than the

random grouping strategy, for the modified non-core entity technique. These granularity levels are all levels

at which over 70% of the functional grouping test cases are homogeneous. The difference in performance

across grouping strategies can be attributed to the fact that homogeneous functional grouping test cases

are more likely than randomly grouped test cases to have similar code coverage characteristics. When code

modifications are limited, the number of test cases encountering those modifications (and thus the number

of test cases selected by a modified non-core entity RTS technique) will be less when the individual test

grains encountering modifications have been collected together into a few test cases, rather than distributed

randomly across many test cases.

Finally, the minimization RTS technique (fourth row of Figure 3) exhibited different behavior. First, we

observe greater variation in the percentage of faults detected with this technique than with the other RTS

techniques. Fault detection effectiveness also seems to have less consistently increasing and more variable

23

Random

1 2 4 8 16 32 640

20

40

60

80

100

Execution T

ime (

min

ute

s)

Functional

1 2 4 8 16 32 64

emp-server bash

Figure 4: Minimization RTS technique execution times averaged across all versions of each program.

tendencies for bash than for emp-server, arguably due to the larger number of relatively difficult-to-detect

faults in bash.

Considering test execution time for the minimization RTS technique, on both programs, differences in

execution time across granularity levels (and consequently, differences in the savings achievable through

minimization across granularity levels) were not large for emp-server, independent of grouping strategy.

However, trends in execution time differ between the two programs, as can be more clearly seen in Figure 4,

which presents the same data with the y-axis scale modified. For the random grouping strategy, emp-server

exhibits little difference in test execution time across granularity levels. On bash, however, test execution

time increases (for the random grouping strategy) from 18 minutes at granularity level G1, to 25 minutes

at level G8, and to 81 minutes at level G64. This difference is likely attributable to differences in coverage

achieved by the test cases for the two programs. Bash’s granularity level G1 test cases, on average, exercise

larger and more varied sets of functions than emp-server test cases, and as these are combined into coarser

granularity test cases, the opportunities for culling out redundancies among those test cases decrease more

rapidly. To summarize, it seems that increases in granularity level can have a negative effect on the ability of

the minimization RTS technique to provide savings, but these results also depend on the program or, more

directly, the coverage patterns achieved on that program by its test cases.

To formally determine whether the impact of test suite granularity and grouping strategy on our depen-

dent variables was statistically significant — corresponding to our first two hypotheses — we performed an

24

Techniques: modified non-core entity and retest-all

Variable: Fault-detection effectiveness.Emp-server Bash

Source SS DF MS F p SS DF MS F pGranularity 9 6 1 3.97 0.00 28 6 5 0.53 0.78Grouping 0 1 0 1.08 0.30 12 1 12 1.37 0.24Technique 8 1 8 20.87 0.00 70 1 70 8.00 0.01Granularity*Grouping 3 6 0 1.23 0.29 35 6 6 0.67 0.67Granularity*Technique 11 6 2 4.97 0.00 5 6 1 0.09 1.00Grouping*Technique 1 1 1 1.55 0.21 1 1 1 0.08 0.78Gran.*Group.*Tech. 1 6 0 0.40 0.88 3 6 0 0.05 1.00Error 82 224 0 1967 224 9Variable: Test execution time.

Emp-server BashSource SS DF MS F p SS DF MS F pGranularity 2851876 6 475313 236.02 0.00 2959544 6 493257 12.55 0.00Grouping 3944 1 3944 1.96 0.16 154812 1 154812 3.94 0.05Technique 402157 1 402157 199.70 0.00 1629161 1 1629161 41.44 0.00Granularity*Grouping 4837 6 806 0.40 0.88 119699 6 19950 0.51 0.80Granularity*Technique 758867 6 126478 62.80 0.00 1779040 6 296507 7.54 0.00Grouping*Technique 1835 1 1835 0.91 0.34 14175 1 14175 0.36 0.55Gran.*Group.*Tech. 2121 6 353 0.18 0.98 3533 6 589 0.02 1.00Error 451100 224 2014 8807271 224 39318

Table 5: Retest-all and modified non-core entity ANOVAs.

analysis of variance. The analysis considers all the factors utilized in the ANOVA for the retest-all technique,

and also incorporates technique as a new source of variation. Because the retest-all technique constitutes

the control technique for regression test selection, we paired it with each of the other RTS techniques to

determine whether those techniques’ effects on the dependent variable were significantly different than the

retest-all technique’s effect, and to determine whether the technique variable was more susceptible than oth-

ers to interactions with other sources of variation. (Because the data for the retest-all and modified-entity

techniques were nearly identical, we omit the comparison between these techniques.)

Table 5 presents the results of this analysis on each program applied to the retest-all and modified

non-core entity techniques. Considering fault detection effectiveness, the results on emp-server indicate

that granularity level and technique did have statistically significant impacts, matching our observations on

Figure 3. On bash, only technique exhibited significance; this was evident in Figure 3 where the retest-

all technique did not exhibit any variation for this dependent variable. Where the lack of significance for

granularity level on bash is concerned, the amount of variance in fault detection effectiveness across versions

could have limited our ability to detect a significant effect with the current number of observations in spite

of the tendencies observed in the graph.

We also found that for emp-server, though not for bash, the interaction between granularity level and

technique was significant, indicating that the impact of granularity level on fault detection effectiveness

differed depending on the technique utilized.

To better understand this interaction and to identify significant differences between means we performed

a Bonferroni multiple comparison analysis. This approach provides a post-hoc comparison of the effects’

means while controlling for the family-wise type of error. Table 6 presents the results of this analysis for all

combinations of granularity level and technique interaction, sorted by the mean fault-detection effectiveness

25

Emp-serverSource: Granularity * TechniqueDependent Variable: Fault Detection EffectivenessGranularity Technique Mean Homogeneous Groups1 modified non-core entity 8.67 A8 modified non-core entity 9.67 B2 modified non-core entity 9.72 B4 modified non-core entity 9.72 B16 modified non-core entity 9.83 B64 modified non-core entity 9.83 B32 modified non-core entity 9.83 B64 retest-all 9.83 B32 retest-all 9.89 B1 retest-all 10.00 B2 retest-all 10.00 B4 retest-all 10.00 B8 retest-all 10.00 B16 retest-all 10.00 B

Table 6: Bonferroni results: Emp-server, granularity * technique, fault detection effectiveness, modifiednon-core entity and retest-all.

of the combinations from smallest to largest. Combinations sharing a given letter in the “Homogeneous

Groups” column belong to the same statistically homogeneous group; combinations not sharing letters are

significantly different. Overall, the analysis shows that for the retest-all technique, changes in granularity

level did not impact fault detection, whereas for the modified non-core entity technique, at the lowest

granularity level, fault detection was significantly smaller than at higher levels.6

Returning to the ANOVAs (Table 5) to consider test execution time, on both programs the same three

factors — granularity level, technique, and their interaction — had statistically significant impact. This

validates our observations that different techniques appeared to be affected in different ways as granularity

level increased, with low granularity levels exposing greater differences between techniques. At higher gran-

ularity levels, reduced execution time savings for RTS techniques, and lower-cost coarser test cases, allowed

the retest-all technique to perform comparably to the RTS techniques. This analysis is confirmed by Bonfer-

roni analyses (Table 7). Results for bash show that at higher granularity levels both techniques performed

similarly, whereas at lower levels (G1 and G2) the retest-all technique was inferior. Further, the ANOVAs

did not reveal significance in the effects of grouping strategy, and thus did not support our observation about

the possible superiority of functional grouping over random in supporting lower test execution times.

Finally, Table 8 presents ANOVA results from the comparison of the retest-all and minimization RTS

techniques. The results on test execution time show significance for the same factors as in the analysis

of the modified non-core entity technique, for both programs. However, the results on fault detection

effectiveness show significance, for both programs, only for technique. As observed earlier, the minimization

RTS technique displayed a large amount of variation in fault detection effectiveness as granularity level

and grouping strategy changed. This variance could be attributable to the known influence of location and

magnitude of changes on the effectiveness of minimization techniques [8].6We include in the text only the Bonferroni results that contribute some new insight into the data. The remainder of

the tables for interaction analysis using Bonferroni, for all cases in which the ANOVAs showed that interaction effects weresignificant, are given in Appendix A.

26

BashSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups64 modified non-core entity 181.89 A16 modified non-core entity 189.72 A32 modified non-core entity 196.44 A64 retest-all 212.78 A8 modified non-core entity 216.06 A32 retest-all 230.67 A4 modified non-core entity 231.11 A16 retest-all 240.83 A1 modified non-core entity 252.00 A2 modified non-core entity 260.44 A8 retest-all 300.33 A B4 retest-all 366.44 A B2 retest-all 520.06 B1 retest-all 782.22 C

Table 7: Bonferroni results: Bash, granularity * technique, test execution time, modified non-core entity andretest-all.

Techniques: Minimization and Retest-all

Variable: Fault-detection effectiveness.Emp-server Bash

Source SS DF MS F p SS DF MS F pGranularity 13 6 2 1.21 0.30 50 6 8 1.00 0.43Grouping 1 1 1 0.65 0.42 18 1 18 2.22 0.14Technique 242 1 242 137.25 0.00 459 1 459 55.38 0.00Granularity*Grouping 6 6 1 0.59 0.74 30 6 5 0.60 0.73Granularity*Technique 18 6 3 1.72 0.12 18 6 3 0.37 0.90Grouping*Technique 1 1 1 0.81 0.37 0 1 0 0.00 1.00Gran.*Group.*Tech. 4 6 1 0.37 0.90 9 6 1 0.18 0.98Error 395 224 2 1855 224 8Variable: Test execution time.


Table 8: Retest-all and minimization ANOVAs.

3.6.3 Test Suite Reduction

To facilitate the comparison between the GHS reduction and retest-all techniques, we depict the data for

these techniques together in Figure 5. The graphs in the first row present results for the retest-all technique,

and the graphs in the second row present results for the GHS reduction technique.

As the graphs show, fault detection effectiveness results for GHS reduction were simillar to the results

observed for the modified non-core entity and minimization RTS techniques, in that reduction left faults

undetected for both grouping strategies and at most granularity levels. Again, bash fared worse than

27

rete

st-

all

Random

1 2 4 8 16 32 640

20

40

60

80

100%

Fa

ults D

ete

cte

d

Functional

1 2 4 8 16 32 64

Random

1 2 4 8 16 32 640

200

400

600

800

1000

Exe

cu

tio

n T

ime

(m

inu

tes)

Functional

1 2 4 8 16 32 64

GH

S r

ed

uctio

n

Random

1 2 4 8 16 32 640

20

40

60

80

100

% F

au

lts D

ete

cte

d

Functional

1 2 4 8 16 32 64

Random

1 2 4 8 16 32 640

200

400

600

800

1000

Exe

cu

tio

n T

ime

(m

inu

tes)

Functional

1 2 4 8 16 32 64

emp-server bash

Figure 5: Fault detection effectiveness for random and functional grouping strategies (columns one and two)and test execution time for random and functional grouping strategies (columns three and four) for theretest-all and test suite reduction techniques across test suite granularities (x-axis), averaged across versions.

emp-server, but not to the same extent as with the RTS techniques. Again, the overall trend is for effec-

tiveness to increase as granularity level increases, although this is more evident for the functional grouping

strategy (where we discount the decrease in effectiveness for emp-server at granularity levels above G8 due

to the non-homogeneity of its test cases at those levels.)

Test execution results for reduction were also similar to results for regression test selection. Test execution

time for emp-server under GHS reduction consistently decreased as granularity level increased, but at

a lower rate than for the control (retest-all) technique. This tendency did not hold, however, for bash,

where execution time increased as granularity level increased. As with the most aggressive RTS techniques,

reduction opportunities can be limited by coarser test cases, but this result varies with program and test

suite characteristics.

We performed an ANOVA to further evaluate these perceived differences and test our hypotheses. Table

9 presents the results for each of our programs and dependent measures. The results for emp-server

28

Techniques: Reduction and Retest-allVariable: Fault-detection effectiveness.

Emp-server BashSource SS DF MS F p SS DF MS F pGranularity 12.33 6 2 5.88 0.00 20.52 6 3 0.48 0.82Grouping 3.11 1 3 8.91 0.00 0.89 1 1 0.13 0.72Technique 24.14 1 24 69.14 0.00 28.67 1 29 4.04 0.05Granularity*Grouping 8.72 6 1 4.16 0.00 14.97 6 2 0.35 0.91Granularity*Technique 13.8 6 2 6.59 0.00 2.19 6 0 0.05 1.00Grouping*Technique 2.68 1 3 7.68 0.01 11.15 1 11 1.57 0.21Gran.*Group.*Tech. 8.26 6 1 3.94 0.00 3.71 6 1 0.09 1.00Error 78.22 224 0 1588.22 224 7Variable: Test execution time.


Table 9: Test Suite Reduction ANOVA.

Emp-serverSource: Grouping * TechniqueDependent Variable: Fault Detection EffectivenessGrouping Technique Mean Homogeneous GroupsRandom GHS reduction 9.13 AFunctional GHS reduction 9.56 BRandom retest-all 9.95 CFunctional retest-all 9.97 C

Table 10: Bonferroni results: Emp-server, grouping * technique, fault detection effectiveness, GHS reductionand retest-all.

on fault detection effectiveness of GHS reduction were somewhat surprising: all factors and interactions

were statistically significant. Based on our observations, we had expected granularity level, technique, and

their interaction to be significant. But here we also found an instance in which grouping strategy did

affect fault detection effectiveness. Analysis of interaction effects (see Table 10), however, indicate that the

impact of grouping occured only for the GHS reduction technique. Fault-detection effectivenes results for

bash, in contrast, indicate that for this program, only technique had a significant impact on fault detection

effectiveness. This difference between programs is likely due to the greater variability, for bash, in fault

detection capabilities of its reduced test suites: the overall standard deviation for the percentage of faults

detected for reduced test suites under bash was 22.6 whereas for emp-server it was 9.1.

On both programs, the effects of granularity level, technique and their interaction on test execution time

were statistically significant. This is similar to our findings for the modified non-core entity and minimization

RTS techniques. For emp-server, however, the interaction between grouping strategy and technique, and

grouping strategy, technique, and granularity level, were also significant with respect to test execution

time. This indicates that, although grouping strategy might not be a significant factor on its own, it can

29

Emp-serverSource: Grouping * TechniqueDependent Variable: Test Execution TimeGrouping Technique Mean Homogeneous GroupsRandom GHS reduction 25.52 AFunctional GHS reduction 28.96 BFunctional retest-all 154.89 CRandom retest-all 157.41 C

Table 11: Bonferroni results: Emp-server, grouping * technique, test execution time, GHS reduction andretest-all.

significantly impact the effect of the other factors. For example, Table 11 shows that for GHS reduction,

the mean execution time for functional grouping test suites was significantly larger than that for random

grouping test suites.

3.6.4 Test Case Prioritization

Our fourth experiment considered test case prioritization. Within this methodology we analyze three tech-

niques: optimal prioritization to provide an upper bound on performance, additional function coverage

prioritization, and additional function coverage prioritization incorporating change information. For brevity,

we use the shorter names “optimal”, “coverage”, and “diff-coverage” for these techniques, respectively.

Figure 6 displays three pairs of graphs, two per technique (one per test grouping strategy), with our

measure of rate of fault detection, APFD, on the y axes. Results for both programs appear similar under the

optimal technique, with a slow but consistent decrease in APFD as granularity level increased independent

of grouping strategy. Having more test cases appears to provide greater opportunities for prioritization;

still, the differences are small. The coverage prioritization technique also displayed a decrease in APFD as

granularity level increased. The rate of decrease was greater for this technique than for the optimal technique,

and more obvious for bash than for emp-server. Similar tendencies can be observed for the diff-coverage

technique, which incorporates modification information. These results confirm the observation that lower

levels of granularity enable more effective prioritization than higher levels.

For all techniques, APFD values for bash were lower than those for emp-server, and APFD values

for bash were more strongly affected by increases in granularity level than were results for emp-server.

This may be attributable to the somewhat more complex coverage characteristics of bash’s test suites: the

coverage achieved by individual test cases in that program’s suites (especially when grouped) varies less than

the coverage achieved by individual test cases for emp-server.

Finally, for the coverage and diff-coverage techniques, on bash and at higher levels of granularity, test

suites obtained using the random grouping strategy seem to have generated higher APFD values than test

suites obtained using the functional grouping strategy.

To helps us determine whether the differences observed in the graphs are statistically significant we

performed two ANOVAs, each considering two levels of the technique variable. The choice of technique

levels in each analysis was based on observations derived from exploratory analysis in which the optimal

technique served as a conservative estimate of a theoretical upper bound.

30

AP

FD

Random

1 2 4 8 16 32 640

10

20

30

40

50

60

70

80

90

100

Functional

1 2 4 8 16 32 64

Random

1 2 4 8 16 32 640

10

20

30

40

50

60

70

80

90

100

Functional

1 2 4 8 16 32 64

Random

1 2 4 8 16 32 640

10

20

30

40

50

60

70

80

90

100

Functional

1 2 4 8 16 32 64

Optimal Coverage Diff-Coverage

bash emp-server

Figure 6: APFD values for test case prioritization.

The first ANOVA (Table 12) focuses on the optimal and the coverage based prioritization techniques. For

both programs, granularity level and technique had a significant effect on the value of APFD. This means

that increasing granularity level resulted in significantly different APFD values, and that APFD values can

change significantly based on whether optimal or coverage prioritization techniques are utilized. Grouping

strategy was also a significant factor for bash.

On both programs, however, the significant main interaction effects are constrained by significant inter-

actions involving granularity level and technique. For example, Table 13 shows that for bash, under the

optimal prioritization technique, changes in granularity level did not have a significant impact on APFD

(all the means for optimal are in the same homogeneous group E), whereas for the coverage prioritization

techique, lower level granularities (under the homogeneous group E) generated significantly higher APFD

values than higher level granularities (e.g., granularity level G64 is under homogeneous group A). In addition,

31

Techniques: Optimal and CoverageVariable: APFD.


Table 12: Optimal and Coverage Prioritization ANOVAs.

BashSource: Granularity * TechniqueDependent Variable: APFDGranularity Technique Mean Homogeneous Groups64 Coverage 65.05 A32 Coverage 80.72 B16 Coverage 87.58 BC8 Coverage 90.50 CD4 Coverage 94.77 DE64 Optimal 96.09 DE2 Coverage 97.41 DE1 Coverage 97.60 E32 Optimal 97.65 E16 Optimal 98.63 E8 Optimal 99.30 E4 Optimal 99.64 E2 Optimal 99.82 E1 Optimal 99.90 E

Table 13: Bash, granularity * technique, APFD.

for bash there were significant interactions between grouping strategy and granularity level, and granular-

ity level, grouping strategy, and technique, and these further constrain the implications of the main effect

results.

The second ANOVA (Table 14) involves the optimal and diff-coverage techniques. The results follow the

significance patterns observed in the previous analysis but with fewer interactions (none for emp-server and

two for bash), which places fewers constraints on the main effects findings. Granularity level and technique

were significant factors for both programs. Grouping strategy was a significant factor for bash, but the high

level of interaction between grouping and technique prompted us to analyze this further. Table 15 presents

the Bonferroni test results on the interaction between grouping and technique, indicating that grouping has

a significant effect only for the diff-coverage technique (random and functional grouping strategies under

optimal belong to the same homogeneous group C).

32

Techniques: Optimal and Diff-CoverageVariable: APFD.


Table 14: Optimal and Diff-Coverage Prioritization ANOVAs.

BashSource: Grouping * TechniqueDependent Variable: APFDGrouping Technique Mean Homogeneous GroupsFunctional Diff-Coverage 77.30 ARandom Diff-Coverage 84.32 BFunctional Optimal 98.69 CRandom Optimal 98.74 C

Table 15: Bash, grouping * technique, APFD.

4 Discussion

We begin our discussion of results by summarizing the overall implications of the foregoing analyses for our

hypotheses. Tables 16 and 17 present summaries for emp-server and bash, respectively. The tables show,

for each analysis performed, for each source of variation and interaction considered, and for each dependent

variable of interest, whether that source of variation or interaction was statistically significant or not in

influencing that dependent variable. Asterisks denote significance and hyphens its absence. Blank entries

under the retest-all column are cases in which analyses did not apply (technique was not a source of variation

in these cases). The modified entity technique behaved the same as the retest-all technique and thus we

omit it from the tables.

With respect to hypothesis H1 (test suite granularity does not have a significant impact on the costs and

benefits of regression testing techniques), our results strongly support the alternative hypothesis. Test suite

granularity had a significant impact on the efficiency of regression testing (as measured by test execution

time) for retest-all, regression test selection, and test suite reduction methodologies: this result occurred in

all cases other than that of the modified entity technique, and was consistent across programs. Granularity

also significantly affected the rate of fault detection achieved by prioritization techniques; this result too was

consistent across programs. Finally, granularity did significantly impact the rate of fault detection achieved

through regression testing techniques, but this result occurred only for emp-server, under the modified

non-core entity RTS and GHS reduction techniques.

With respect to hypothesis H2 (test input grouping does not have a significant impact on the costs

and benefits of regression testing techniques), we are not able to unequivocally reject the null hypothesis.

33

retest-all selection reduction prioritization

mod’d noncore minimization GHS coverage diff-coveragevs retest-all vs retest-all vs retest-all vs optimal vs optimal

exec fde exec fde exec fde exec fde apfd apfd

granularity * - * * * - * * * *grouping - - - - - - - * - *technique * * * * * * * *gran*grp - - - - - - - * - *gran*tech * * * - * * * *grp*tech - - - - * * - -gran*grp*tech - - - - * * - *

Table 16: Summary of significant effects for emp-server. Columns headed “exec” pertain to execution time,and columns headed “fde” pertain to fault detection effectiveness. “*” entries indicate cases in which thesource of variation or interaction listed in column 1 was statistically significant, and “-” entries indicate caseswhere significance was not found.

retest-all selection reduction prioritization

mod’d noncore minimization GHS coverage diff-coveragevs retest-all vs retest-all vs retest-all vs optimal vs optimal

exec fde exec fde exec fde exec fde apfd apfd

granularity * - * - * - * - * *grouping - - - - - - - - - *technique * * * * * * * *gran*grp - - - - - - - - - -gran*tech * - * - * - - *grp*tech - - - - - - - *gran*grp*tech - - - - - - - -

Table 17: Summary of significant effects for bash. Columns headed “exec” pertain to execution time, andcolumns heade “fde” pertain to fault detection effectiveness. “*” entries indicate cases in which the source ofvariation or interaction listed in column 1 was statistically significant, and “-” entries indicate cases wheresignificance was not found.

Among the retest-all, regression test selection, and GHS reduction techniques, test input grouping exhibited

a significant effect in only one case: that of GHS reduction applied to emp-server. Test input grouping also

affected test case prioritization, but only for the diff-coverage technique.

With respect to hypothesis H3 (regression testing techniques do not perform significantly differently

in terms of the selected costs and benefits measures), results are consistent with those observed in earlier,

comparative studies of the techniques considered. For example, previous studies have shown that the modified

entity technique, applied at the function level, may achieve no savings [3], that test suite reduction can exhibit

varying degrees of fault detection effectivenes loss [38, 43], that tradeoffs between test execution time and

fault-detection effectiveness among non-safe RTS techniques such as those we have seen here exist [15], and

that the prioritization techniques [11] we examined here relate to one another in the way we have observed

here. In the context of this article, where our primary interest lies in observing the effects of test suite

granularity and test input grouping, this consistency with earlier results is important primarily because it

supports the conjecture that our results will generalize beyond the cases considered.

34

Finally, with respect to hypothesis H4 (test suite granularity and test input grouping effects across

regression testing techniques and programs do not significantly differ), we discovered frequent interaction

effects between technique and granularity, and our Bonferroni analyses further analyze these interactions.

In all but one case (coverage versus optimal prioritization on bash), a significant granularity effect was

accompanied by a significant granularity-technique interaction: the implication is that when granularity

has an effect, different regression testing techniques are affected differently than the control technique as

granularity level changes. Interactions with grouping are less frequent; in the few cases in which grouping

exhibited an impact, it also interacted with technique, and in one case (execution time for GHS reduction)

interaction effects were observed even though grouping alone had not had an impact.

Taken together, these results suggest that test suite granularity plays an important role in regression

testing cost-effectiveness – a role that merits attention by practitioners and further exploration by researchers.

The results further suggest that test input grouping may matter, but plays a less important role than

granularity. Moreover, the relative consistency of test suite granularity results across two quite different

test input groupings itself supports a conjecture that the test suite granularity results observed here may

generalize to other test input groupings.

Issues of the generality of these results can be more conclusively addressed only through replication of

these experiments on additional workloads. We can, however, suggest several further practical implications

of the results, and draw several additional observations on the data, as follows.

4.1 Implications for Common Practice (Retest-All)

The retest-all technique is arguably the most prevalently used regression testing technique in practice [30],

and is particularly appropriate when complete test suites can be executed, and their results validated, in

an amount of time considered reasonable by the testing organization (e.g., when fully automated test suites

have automated oracles and can run to completion overnight).

Our results show that the use of coarse granularity test suites can greatly increase efficiency for the

retest-all technique. For example, increasing granularity level from G1 to G4 on the emp-server test suite

saved an average of 365 minutes (a 72% reduction) in test execution time under retest-all. The same

granularity level increase on bash saved 415 minutes (a 53% reduction) in test execution time. Our results

also show that granularity need not adversely impact fault-detection effectiveness under retest-all: across all

our observations of the technique, only five cases occurred in which a fault detected at a lower granularity

level was not detected at a higher level when the entire test suite was executed.

The implication of these results, when coupled with their consistency across programs and test input

groupings, is that test engineers can safely harness granularity to increase the likelihood that they can afford

to use the retest-all technique.

This conclusion should, however, be qualified. The savings we observed in test suite execution time in our

experiments, using coarse granularity test suites, can be attributed primarily to reductions in the overhead

associated with test setup and cleanup. Our coarse granularity test suites apply just as many inputs as their

constituent test grains, but require less overhead in the number of setup and cleanup operations required.

35

Test suites in which test cases have lower overhead than these would be less conducive to providing practically

meaningful time savings through increases in granularity. For such suites other factors, such as the support

that fine granularity provides for prioritization or the greater simplicity of localizing faults uncovered by

small granularity test cases, may be of more value in establishing an appropriate granularity.

A second qualification concerns effects that may occur due to program complexity relative to input

size. The programs we have studied, like many programs, typically execute in time (rougly speaking) linear

in input size. For such programs, it is easy to envision why, in the case in which test cases incur some

startup costs, increased test suite granularity should lead to reduced execution time. Programs with higher

complexity relative to input size, however, such as programs that run in time quadratic in input size, may

display different relationships, because the increase in processing time incurred due to larger inputs may be

greater than the savings in execution time incurred due to coarser granularity.

It is also worth noting that, in our studies, the efficiency gains achieved by increasing granularity level

were greatest when starting from low granularities. For example, increasing granularity level from G1 to

G4 on emp-server saved 365 minutes in test execution time, but doubling granularity level further to G8

saved only 80 additional minutes, and doubling it again to G16 saved only 20 minutes more. As granularity

level increases, the returns achieved from further increases diminish, and this may allow factors other than

granularity to take on greater practical importance, above a certain granularity level, than further increases

in level of granularity. The results of our experiments, therefore, should not be interpreted to imply that the

most cost-effective granularity level in practice for a given test suite T is |T |.A final qualification concerns the effects of test oracle accuracy. In practice, regression testing oracles

range from those that exhaustively compare all components of program output to those that simply check

subsets of system state at specific checkpoints (as is often done, for example, by JUnit test cases through

embedded assertions). Intuitively, we expect this range of oracle rigour to contribute to both the fault

detection effectiveness of test suites, and the expense of executing those suites.

In our initial study of the effects of test suite granularity [33], when considering the retest-all technique

applied to a subset of the versions of bash and emp-server considered here, we found (in contrast to our

findings here) that coarser granularity test suites detected faults more effectively than finer granularity suites.

We surmised that these gains in fault detection effectiveness might be partially attributed to the execution,

by test cases in coarse granularity test suites, of additional code that causes data state changes occurring in

earlier stages of execution to be visible. We reasoned that fine granularity test suites could be more effective

if they were equipped with better oracles.

For these experiments, we improved the oracles used to validate the results of emp-server and bash test

cases; the improved techniques analyze additional output data beyond, and perform more precise differencing

of data than, the techniques used in the previous experiments. This resulted in an increase in the fault

detection effectiveness of our test cases and test suites to the levels observed here, where the fault detection

effectiveness of test suites under the retest-all technique was not significantly affected by test suite granularity.

These observations yield an interesting conclusion about test oracle design as it relates to test suite

granularity. Inaccurate oracles may be a greater liability than accurate ones for fault detection effectiveness

36

at low levels of granularity than at high levels. At high granularity levels, test cases apply more inputs and

have greater opportunities to detect discrepancies than at low levels, compensating for oracle limitations.

Together the results of our experiments suggest that, if practitioners equip their test cases with the right

observers, they should be less likely to perceive differences in fault detection effectiveness as granularity

increases than when they employ weaker oracles.

4.2 Implications for Regression Test Selection

An advantage of the retest-all technique is that it does not discard test cases that could reveal faults, and this

advantage was illustrated in our experiments. Nevertheless, the retest-all technique is not always a viable or

cost-effective option. Re-execution of the entire test suite may require more time than an organization can

spare, or require large amounts of expensive human effort (e.g., when validation is not automated) that could

be better spent on other tasks. In such cases, engineers may use regression test selection (RTS) techniques

to choose the test cases that are important for use in validating a particular system release.

Safe RTS techniques are guaranteed (under specific conditions) to not omit, in their selection, test cases

that can reveal faults, and have been shown to reduce regression testing time [7, 15, 34, 37]. In our particular

experiments, however, for the particular test suites and program versions we utilized, the (safe) modified

entity technique always selected all test cases, and provided no savings. We thus focus on the implications

of our results for non-safe RTS techniques.

Non-safe RTS techniques provide a wide range of efficiency/effectiveness tradeoffs, balancing the ben-

efits gained in test execution costs against the risks involved in losing fault detection effectiveness. Our

experiments considered both an aggressively selective technique (minimization RTS), and a less aggressively

selective technique (modified non-core entity RTS), and our results have several implications for each of these

approaches.

Considering test suite granularity first, and beginning with the less aggressive modified non-core entity

technique, finer granularity test suites are clearly more supportive of modified non-core entity regression test

selection than coarser granularity suites, since the ability of the technique to decrease regression testing time

decreased rapidly as granularity level increased. This tendency was evident for both programs. For example,

when the modified non-core entity technique was applied to the level G1 test suite for emp-server, that test

suite’s average execution time was reduced from 505 to 180 minutes (a 64% time reduction). When the same

technique was applied to the level G64 suite for emp-server, the average saving was less than 9%. Finer

granularity provides greater flexibility than coarse granularity, by promoting larger numbers of small test

cases that can be successfully manipulated by the technique to reduce the cost of test execution.

Our results also show that, even when organizations can afford to employ a retest-all technique using

test suites composed at some granularity level Gk, they may be able to save regression testing time by using

some lower level of granularity together with the modified non-core entity RTS technique, and gain any

other benefits (e.g., fault localization ease or prioritizability) that accrue for finer granularity test cases. For

example, on bash, modified non-core entity RTS applied to level G1 test suites resulted in more efficient

re-testing than retest-all applied to level G2 or G4 test suites. Whether such efficiency gains are worthwhile

37

depends, however, on whether the fault detection effectiveness loss that can accompany modified non-core

entity is acceptable to the testing organization.

When we turn to aggressively selective RTS techniques, represented here by minimization RTS, we find

much greater potential for savings, with much greater potential for fault detection loss. Here, however, the

effects of granularity are somewhat mixed: on one of our programs (emp-server), higher granularity levels

produced savings in test execution time, and on the other (bash) they reduced savings, and granularity level

had no effect on fault detection effectiveness. We suspect that the very aggressiveness of the minimization

RTS technique, which leads to relatively large test suite reductions, may cause granularity effects to assume

less influence on cost-effectiveness than other factors.

Considering test input grouping, we were unable to reject the null hypothesis (H2) for non-safe RTS

techniques for execution time or fault detection effectiveness loss. This suggests that our results with respect

to test suite granularity are to some degree robust over the test suite characteristics captured by our test

input grouping construct. An implication is that future empirical work in this area could focus, without loss

of internal validity, on studying the effects of test suite granularity and technique.

Finally, our data suggests that fault difficulty can influence granularity effects. As noted earlier, most

faults in bash were relatively difficult to expose, with 99% revealed by fewer than 1% of that program’s

level G1 test cases. In contrast, only 23% of the faults in emp-server were exposed by fewer than 1% of

that program’s level G1 test cases. This difference can be seen as responsible for two effects identified in

Section 3.6.2. First, with the modified non-core entity RTS technique, granularity significantly affected fault

detection effectiveness on emp-server, with higher granularity levels typically increasing effectiveness; but

this result did not occur for bash. Second, with the minimization RTS technique, fault detection effectiveness

varied more greatly across granularity levels for bash than for emp-server.

With respect to the increases in fault detection effectiveness that accompany granularity level increases

for emp-server, we expect that the “observer effect” previously mentioned is at least partially responsible

for these results. Test cases in coarser granularity test suites have somewhat greater fault detection abilities

than their counterparts in finer granularity suites due to increased opportunities for state and output changes

to be revealed. But these results suggest that fault difficulty plays a role in this effect.

With respect to the variance in fault detection effectiveness seen for bash, however, a different factor

emerges. Test cases that expose faults singly can fail to do so when composed with other test cases due to

interactions. When faults are detected by only a few test inputs in a fine granularity test suite, relatively

few interactions need occur in a coarse granularity suite composed of those test inputs to cause those faults

to go undetected there. When subsets of test suites are selected, the likelihood that difficult-to-detect faults

will go undetected in coarse granularity test suites increases further, because there are fewer opportunities

for including test cases in which fault-masking interactions do not occur. (This is further evident in the fact

that, for the retest-all technique, the effects of fault difficulty did not influence fault-detection effectiveness.)

If these results generalize, an implication that when practitioners utilize coarse granularity test suites,

they may expect that these suites will be relatively strong (compared to fine granularity suites) at revealing

relatively easy-to-detect faults, but relatively weak at revealing difficult-to-detect faults.

38

4.3 Implications for Test Suite Reduction

Test suite reduction and minimization RTS each seek test suite subsets that provide minimal coverage of

specific program entities (e.g. functions); they differ, however, in that reduction seeks minimal coverage of

all covered program entities, while minimization RTS seeks minimal coverage of covered modified entities.

The primary effect of this difference, in our study, is that test suite reduction results in larger test suites

than minimization RTS. This size difference results in greater fault detection effectiveness, and greater test

execution times, for test suites reduced by GHS reduction than for those selected by minimization RTS. In

cases where aggressive reduction in testing effort is needed, GHS reduction may be a more cost-effective

alternative than minimization RTS.

Where test suite granularity is concerned, GHS reduction shares most of the effects and implications seen

for minimization RTS. The effects of granularity on test execution time are somewhat mixed across programs,

with coarse granularity adversely impacting execution time for bash, but improving it for emp-server.

Fault detection effectiveness loss is greater for bash, with its relatively difficult-to-detect faults, than for

emp-server, but variance in detection is greater for bash. Thus, we cannot yet provide to practitioners,

based on our data, clear evidence that any particular choice of granularity level is generally most cost-effective

than other choices for reduction.

The one effect observed for GHS reduction that was not shared with minimization RTS involves the effects

of test input grouping observed on emp-server. On this program, test input grouping was a significant

factor for fault detection effectiveness; in particular, functional grouping yielded better and more consistent

fault detection effectiveness, for reduced test suites, than random grouping. This suggests that in at least

some cases, functional grouping test suites may be preferable to random suites for practitioners anticipating

applying test suite reduction. Because this effect did not occur on bash, we conjecture that it may not hold

with respect to relatively hard-to-detect faults. On the other hand, functional grouping did not adversely

affect results on bash, either, so its use may not carry risk.

4.4 Implications for Test Case Prioritization

Whereas the retest-all, regression test selection, and test suite reduction methodologies are essentially mu-

tually exclusive, test case prioritization can be applied in conjunction with these methodologies to order all

test cases, selected test cases, or reduced test suites. This has implications for our prioritization results.

For example, our results show that finer test suite granularity is likely to provide greater opportunities for

prioritization and support higher APFD values than coarser granularity. This occurs because when coarse

granularity test cases are decomposed into finer granularity ones, the scope of the effects of the average test

case (e.g., its coverage, or its relationship with changed code) decreases, allowing prioritization techniques

to more precisely discriminate between test cases. This provides additional impetus to engineers using the

retest-all technique to choose a middle ground in granularity if they care about rate of fault detection. It

also provides an additional argument for engineers using RTS techniques to use fine granularity test cases.

In particular, engineers employing safe RTS techniques may need some method for responding to cases

in which their techniques fail to reduce test suite size, as occurred for the modified entity technique in our

39

studies. One response involves falling back on prioritization as a technique for placing important test cases

early, facilitating faster detection of (and response to) faults. Fine granularity test suites facilitate this.

It is important to note, however, that these implications also vary with the difficulty of detecting the

faults that exist in the program under test. Programs for which the number of fault-exposing test cases is

large are less likely to suffer APFD losses from increases in granularity than programs for which the number

of fault-exposing test cases is small. This result was most evident in our data when considering the APFD

results for the coverage prioritization technique under the functional test input grouping. In this case, the

APFD for bash was reduced by 37 points as granularity level increased from G1 to G64, whereas the APFD

for emp-server (which had fewer difficult-to-detect faults) was reduced by only 6 points.

One further implication of this consideration pertains to testing processes, which are typically driven by

tradeoffs between the expense of testing and the desire to detect faults. Where rate of fault detection is

concerned, when running test cases during development (especially as in test-driven development processes,

or test-every-night processes) where initial, easier-to-find faults might be expected to be common, coarse-

grained test cases that run faster due to lower setup time requirements will be most cost-effective. When

running system tests at the end of development cycles, where the probabilities of individual test cases failing

are smaller and the testing interval may be somewhat longer, fine-grained test cases will be most cost-

effective. Test suite designers might do well, therefore, to build flexibility into their test suites, such that

the granularity of those suites can be adjusted to meet the needs of particular testing stages.

Finally, where our prioritization results are concerned, in all but one case considered (coverage versus

optimal on bash), significance in granularity was accompanied by significance in granularity-technique in-

teraction: when granularity has an effect, different techniques are affected differently as granularity level

changes. For practitioners, the implication of this is that, in judging the relative effectiveness of techniques,

it is not sufficient to consider just results of those techniques, granularity must also be considered. For

researchers, the implication of this is that, when experimenting with techniques, it is important to specify

the workload (test suite characteristics) being utilized.

4.5 The Effects of Granularity and Grouping on Fault Detection per Test Case

In the preceding analyses and discussion we focused on a measure of fault-detection-effectiveness relative to

test suites, or reduced or selected subsets of test suites. Under this measure, for the retest-all methodology

and using our improved oracle and failure detection tools, our test suites did not lose significant fault-

detection effectiveness as granularity increased or decreased. Reduced or selected subsets of test suites,

however, did lose fault-detection-effectiveness, and did exhibit fault-detection-effectiveness that varied at

different test suite granularities.

To investigate the cause of this difference further, we look further at our data, turning our attention away

from entire test suites or test suite subsets, and towards the fault-detection effectiveness of the individual

test cases that compose these suites and subsets. To do this, we consider each of our coarse-grained test

cases at each granularity level Gk, and investigate the fault-detection effectiveness of these test cases singly,

versus the fault-detection effectiveness of their constituent level G1 test cases. On this view, with respect to

40

a specific test case tGk at level Gk (k > 1), and its constituent set S(tGk), the set of all granularity level G1

test cases t1, t2, . . . , tk, used to construct it, and with respect to a particular fault f , four categories of test

cases exist:

1. Equal-omission: tGk fails to detect f , and each test case ti ∈ S(tGk) fails to detect f .

2. Detection-lost: tGk fails to detect f , but there exists at least one test case ti ∈ S(tGk) that detects f .

3. Detection-gained: tGk detects f , even though each test case ti ∈ S(tGk) fails to detect f .

4. Equal-detection: tGk detects f , and there exists at least one test case ti ∈ S(tGk) that detects f .

These categories can help us track whether the process of composing test cases into coarser test cases

causes gains or losses in fault-detection effectiveness at the level of individual Gk test cases, as opposed to

the level of entire test suites composed of Gk test cases.

Figure 7 uses a stacked-bar chart to depict the percentages of test cases in our test suites that fall into

each of the foregoing categories, for granularity levels G2 through G64. The chart on the left corresponds

to random groupings, and the chart on the right corresponds to functional groupings. In each chart, the

horizontal axis represents test suite granularity, and the vertical axis, scaled 0 through 100%, represents

percentages of the total number of test cases in a test suite. A pair of bars are shown together at each

granularity level; the first corresponds to bash and the second to emp-server. Each bar is a composite,

with constituent bars stacked over one another, representing the equal-omission, detection-lost, detection-

gained, and equal-detection categories, from top to bottom, respectively. The percentage of test cases in each

category, under each granularity level and grouping strategy, is averaged across all the faults and versions of

each program.

Consider the results for level G2 under the random test input grouping (the two leftmost bars in the

leftmost chart). The first bar corresponds to bash and shows that more than 99% of the G2 test cases for

bash were classified as equal-omission: the constituent test cases did not detect faults, and composing them

caused no change in fault detection. Although not discernable in the graph, only an average of 0.33% of

the G2 test cases for bash were classified as equal-detection, detecting one or more faults also detected by

constituents. No detection-lost test cases were found, and only 0.12% of the test cases were classified as

detection-gained. Results for emp-server for level G2 are similar except that almost 9% of the G2 test cases

for emp-server classified as equal-detection (reflecting the fact that emp-server had more relatively easy

to detect faults than bash).

Continuing in this manner of observation across granularity levels, we observe that for bash, the percent-

age of equal-detection test cases increases consistently as granularity level increases. We also observe that

the percentage of detection-gained test cases for bash increases with granularity level from approximately

1% at level G16 to about 3% at level G64. Detection-lost cases are noticeable only at level G64, where on

average less than one test case masks a fault. For emp-server, however, detection-lost cases outnumber

detection-gained cases, especially at levels G16, G32, and G64.

41

Random Functional

bashempire-server

Equal-omission

Detection-lostDetection-gained

Equal-detection

Bars show Means

G2 G4 G8 G16 G32 G64Test Suite Granularity

0%

20%

40%

60%

80%

100%

Per

cent

age

of T

est C

ases

G2 G4 G8 G16 G32 G64Test Suite Granularity

Figure 7: Fault-detection effectiveness effects at the individual test case level.

In the case of functional groupings (the rightmost chart in Figure 7) we observe a difference in the

percentage of equal-omission test cases when compared with random grouping. This is clearly noticeable

for emp-server from levels G2 through G16. In other words, although the test suite’s overall effectiveness

remained the same across groupings, fewer test cases revealed faults. Bash under functional grouping exhibits

a slightly larger percentage of detection-lost test cases at levels G32 and G64 in relation to random grouping.

Correspondingly, the percentage of detection-gained test cases at these levels for bash is smaller for functional

grouping than for random grouping. Emp-server exhibits an opposite trend, with slightly reduced detection-

lost test cases for functional grouping than random grouping at the G32 and G64 granularity levels.

Overall, increases in granularity level are associated with increases in the percentages of both detection-

lost and detection-gained test cases. Further, at lower granularities, functional grouping test suites have a

smaller percentage of equal-detection test cases.

The significance of this discussion lies partly in its ability to help explain our fault-detection effectiveness

results for retest-all, regression test selection and test case reduction. The test cases in our test suites

42

are collectively strong enough to reveal all of the faults in our programs. When test inputs are composed

into coarser granularity test cases, the fault-revealing capabilities of many individual test cases change, but

situations in which detection power is lost by some test cases are compensated for by other test cases.

This held true both for emp-server with its somewhat more frequently detected faults, and for bash with

its somewhat less frequently detected faults, for the retest-all technique. For that technique (or for safe

RTS techniques in general), fault-detection effectiveness at the test suite level is what matters, and in our

experiments, granularity effects did not significantly affect fault-detection effectiveness at that level.

When considering regression testing techniques that select from among test cases (regression test selection

and test suite reduction), or test case prioritization techniques that evaluate results relative to individual test

cases, the situation changes. Here, the potential for test case granularity to alter fault-detection-effectiveness

at the individual test case level takes on greater importance, because as the number of test cases composing a

test suite is reduced, the importance of individual test cases relative to the entire suite increases. This factor

likely contributes to the cases in which our selective methodologies and test case prioritization techniques

exhibit significant effects in fault detection effectiveness as test suite granularity varies.

5 Conclusion

Writers of testing textbooks have long shown awareness that the composition of test suites can affect the

cost-effectiveness of testing. These effects can begin when testing the initial release of a system, where success

in finding faults in that release, as well as the amount of testing that can be accomplished, can vary based on

test suite granularity and test input grouping. Software that succeeds, however, subsequently evolves: the

costs of testing that software are compounded over its lifecycle, and the opportunity to miss faults through

inadequate regression testing occurs with each new release. It is thus imperative that researchers study the

effects of test suite design across the entire software lifecycle.

Several test suite design factors, such as test suite size and adequacy criteria, have been empirically

studied, but few have been studied with respect to evolving software. Several regression testing methodologies

have been empirically studied, but few with respect to issues in test suite design. This article brings the

empirical study of test suite design and regression testing methodologies together, focusing on two particular

design factors: test suite granularity and test input grouping. Our results highlight several cost-benefits

tradeoffs associated with these factors, and related to regression testing techniques and processes.

Empirical studies such as those that we have described here can provide evidence for or against hypotheses

such as those we have investigated, but cannot prove them. Instead, validity concerns must be addressed

by additional studies using different programs and other artifacts, alternative measures, and alternative

methodologies. Only through such repetition can a body of evidence be built on behalf of such hypotheses,

rendering results more general. This work lays the groundwork for such further studies.

43

ACKNOWLEDGEMENTS

This work was supported by the NSF Information Technology Research program under Awards CCR-0080898

and CCR-0080900 to University of Nebraska, Lincoln and Oregon State University, and by NSF Awards CCR-

9703108 and CCR-9707792 to Oregon State University. We thank Satya Kanduri and Srikanth Karre for

helping prepare the emp-server and bash subjects.

References

[1] J. Bach. Useful features of a test automation system (part iii). Testing Techniques Newsletter, October

1996.

[2] B. Beizer. Black-Box Testing. John Wiley and Sons, New York, NY, 1995.

[3] J. Bible, G. Rothermel, and D. Rosenblum. Coarse- and fine-grained safe regression test selection. ACM

Transactions on Software Engineering and Methodology, 10(2):149–183, April 2001.

[4] R. Binder. Testing Object-Oriented Systems. Addison Wesley, Reading, MA, 2000.

[5] D. Binkley. Semantics guided regression test cost reduction. IEEE Transactions on Software Engineer-

ing, 23(8), August 1997.

[6] T.Y. Chen and M.F. Lau. Dividing strategies for the optimization of a test suite. Information Processing

Letters, 60(3):135–141, March 1996.

[7] Y.F. Chen, D.S. Rosenblum, and K.P. Vo. TestTube: A system for selective regression testing. In

Proceedings of the 16th International Conference on Software Engineering, pages 211–220, May 1994.

[8] S. Elbaum, K. Kallakuri, A. G. Malishevsky, G. Rothermel, and S. Kanduri. Understanding the Effects

of Changes on the Cost-Effectiveness of Regression Testing Techniques. Journal of Software Testing,

Verification, and Reliability, 13(2):–, June 2003.

[9] S. Elbaum, A. Malishevsky, and G. Rothermel. Prioritizing test cases for regression testing. In Proceed-

ings of the International Symposium Software Testing and Analysis, pages 102–112, August 2000.

[10] S. Elbaum, A. Malishevsky, and G. Rothermel. Incorporating varying test costs and fault severities into

test case prioritization. In Proceedings of the 23rd International Conference on Software Engineering,

pages 329–338, May 2001.

[11] S. Elbaum, A. G. Malishevsky, and G. Rothermel. Test case prioritization: A family of empirical studies.

IEEE Transactions on Software Engineering, 28(2):159–182, February 2002.

[12] S. Elbaum, J. Munson, and M. Harrison. CLIC: A tool for the measurement of software system dynamics.

In SETL Technical Report - TR-98-04., 04 1998.

44

[13] S. Elbaum, G. Rothermel, S. Kanduri, and A. G. Malishevsky. Selecting a cost-effective test case

prioritization techniques. Technical Report 03-01-01, University of Nebraska - Lincoln, January 2003.

[14] K.F. Fischer, F. Raji, and A. Chruscicki. A methodology for retesting modified software. In Proceedings

of the Nat’l. Tele. Conference B-6-3, pages 1–6, November 1981.

[15] T.L. Graves, M.J. Harrold, J-M Kim, A. Porter, and G. Rothermel. An empirical study of regression

test selection techniques. In Proceedings of the 20th International Conference on Software Engineering,

pages 188–197, April 1998.

[16] R. Gupta, M.J. Harrold, and M.L. Soffa. An approach to regression testing using slicing. In Proceedings

of the Conference on Software Maintenance, pages 299–308, November 1992.

[17] M. J. Harrold, R. Gupta, and M. L. Soffa. A methodology for controlling the size of a test suite. ACM

Transactions on Software Engineering and Methodology, 2(3):270–285, July 1993.

[18] M.J. Harrold and G. Rothermel. Aristotle: A system for research on and development of program

analysis based tools. Technical Report OSU-CISRC- 3/97-TR17, Ohio State University, Mar 1997.

[19] J. Hartmann and D.J. Robson. Revalidation during the software maintenance phase. In Proceedings of

the Conference on Software Maintenance, pages 70–79, October 1989.

[20] R. Hildebrandt and A. Zeller. Minimizing failure-inducing input. In Proceedings of the International

Symposium on Software Testing and Analysis, pages 135–145, August 2000.

[21] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments on the effectiveness of dataflow- and

controlflow-based test adequacy criteria. In Proceedings of the International Conference on Software

Engineering, pages 191–200, May 1994.

[22] J. A. Jones and M. J. Harrold. Test-suite reduction and prioritization for modified condition/decision

coverage. In Proceedings of the International Conference on Software Maintenance, pages 92–101, Oc-

tober 2001.

[23] C. Kaner, J. Falk, and H. Q. Nguyeen. Testing Computer Software. Wiley and Sons, New York, 1999.

[24] J-M Kim, A. Porter, and G. Rothermel. An empirical study of regression test application frequency. In

Proceedings of the 22nd International Conference on Software Engineering, pages 126–135, June 2000.

[25] E. Kit. Software Testing in the Real World. Addison-Wesley, Reading, MA, 1995.

[26] H.K.N. Leung and L. White. Insights into regression testing. In Proceedings of the Conference on

Software Maintenance, pages 60–69, October 1989.

[27] H.K.N. Leung and L.J. White. A study of integration testing and software regression at the integration

level. In Proceedings of the Conference on Software Maintenance, pages 290–300, November 1990.

45

[28] D. Libes. Exploring Expect: A Tcl-Based Toolkit for Automating Interactive Programs. O’Reilly &

Associates, Inc., Sebastopol, CA, November 1996.

[29] J. Offutt, J. Pan, and J. M. Voas. Procedures for reducing the size of coverage-based test sets. In

Proceedings of the Twelfth International Conference on Testing Computer Software, pages 111–123,

June 1995.

[30] K. Onoma, W-T. Tsai, M. Poonawala, and H. Suganuma. Regression testing in an industrial environ-

ment. Communications of the ACM, 41(5):81–86, May 1988.

[31] T.J. Ostrand and M.J. Balcer. The category-partition method for specifying and generating functional

tests. Communications of the ACM, 31(6), June 1988.

[32] C. Ramey and B. Fox. Bash Reference Manual. O’ReillyO’Reilly & Associates, Inc., Sebastopol, CA,

2.2 edition, 1998.

[33] G. Rothermel, S. Elbaum, A. Malishevsky, P. Kallakuri, and B. Davia. The impact of test suite

granularity on the cost-effectiveness of regression testing. In Proceedings of the International Conference

on Software Engineering, May 2002.

[34] G. Rothermel and M. J. Harrold. Empirical studies of a safe regression test selection technique. IEEE

Transactions on Software Engineering, 24(6):401–419, June 1998.

[35] G. Rothermel and M.J. Harrold. Analyzing regression test selection techniques. IEEE Transactions on

Software Engineering, 22(8):529–551, August 1996.

[36] G. Rothermel and M.J. Harrold. A safe, efficient regression test selection technique. ACM Transactions

on Software Engineering and Methodology, 6(2):173–210, April 1997.

[37] G. Rothermel, M.J. Harrold, and J. Dedhia. Regression test selection for C++ programs. Journal of

Software Testing, Verification, and Reliability, 10(2), June 2000.

[38] G. Rothermel, M.J. Harrold, J. Ostrin, and C. Hong. An empirical study of the effects of minimization

on the fault detection capabilities of test suites. In Proceedings of the International Conference on

Software Maintenance, pages 34–43, November 1998.

[39] G. Rothermel, R. Untch, C. Chu, and M.J. Harrold. Test case prioritization. IEEE Transactions on

Software Engineering, October 2001.

[40] A. Srivastava and J. Thiagarajan. Effectively Prioritizing Tests in Development Environment. In

Proceedings of the International Symposium on Software Testing and Analysis, July 2002.

[41] F. I. Vokolos and P. G. Frankl. Pythia: a regression test selection tool based on textual differencing.

In Proceedings of the 3rd International Conference on Rel., Quality & Safety of Software-Intensive Sys.

(ENCRESS ’97), May 1997.

46

[42] C. Wohlin, P. Runeson, M. Host, B. Regnell, and A. Wesslen. Experimentation in Software Engineering.

Kluwer Academic Publishers, Boston, MA, 2000.

[43] W. E. Wong, J. R. Horgan, S. London, and A. P. Mathur. Effect of test set minimization on fault

detection effectiveness. Software Practice and Experience, 28(4):347–369, April 1998.

[44] W.E. Wong, J.R. Horgan, S. London, and H. Agrawal. A study of effective regression testing in practice.

In Proceedings of the Eighth International Symposium on Software Reliability Engineering, pages 230–

238, November 1997.

47

Appendix A: Additional Analyses of Significant Interactions

A.1 Regression Test Selection

Emp-serverSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups64 modified non-core entity 24.11 A64 retest-all 25.65 A32 modified non-core entity 32.17 A B32 retest-all 34.24 A B16 modified non-core entity 41.82 A B16 retest-all 49.29 A B8 modified non-core entity 55.04 A B4 modified non-core entity 80.21 B C8 retest-all 80.47 B C2 modified non-core entity 120.37 C D4 retest-all 139.84 D E1 modified non-core entity 180.06 E2 retest-all 258.41 F1 retest-all 505.15 G

Table 18: Emp-server, granularity * technique, test execution time, modified non-core entity and retest-all.

Emp-ServerSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups16 minimization 6.37 A32 minimization 6.70 A64 minimization 6.91 A8 minimization 7.52 A4 minimization 8.74 A2 minimization 8.97 A1 minimization 9.62 A64 retest-all 25.65 B32 retest-all 34.24 B16 retest-all 49.29 C8 retest-all 80.47 D4 retest-all 139.84 E2 retest-all 258.41 F1 retest-all 505.15 G

Table 19: Emp-Server, granularity * technique, test execution time, minimization and retest-all.

48

BashSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups2 minimization 16.33 A1 minimization 18.44 A4 minimization 26.00 A16 minimization 39.56 A8 minimization 43.17 A B32 minimization 69.44 A B C64 minimization 72.22 A B C D64 retest-all 212.78 B C D E32 retest-all 230.67 C D E16 retest-all 240.83 D E8 retest-all 300.33 E4 retest-all 366.44 E F2 retest-all 520.06 F1 retest-all 782.22 G

Table 20: Bash, granularity * technique, test execution time, minimization and retest-all.

A.2 Test Suite Reduction

Emp-serverSource: Granularity * GroupingDependent Variable: Fault Detection EffectivenessGranularity Grouping Mean Homogeneous Groups8 Random 9.06 A2 Random 9.28 A B1 Random 9.28 A B1 Functional 9.28 A B64 Random 9.56 A B C32 Functional 9.56 A B C16 Random 9.72 A B C64 Functional 9.72 A B C2 Functional 9.83 B C32 Random 9.89 B C8 Functional 9.94 B C4 Functional 10.00 C4 Random 10.00 C16 Functional 10.00 C

Table 21: Emp-server, granularity * grouping, fault detection effectiveness.

49

Emp-serverSource: Granularity * TechniqueDependent Variable: Fault Detection EffectivenessGranularity Technique Mean Homogeneous Groups1 GHS reduction 8.56 A8 GHS reduction 9.00 A B2 GHS reduction 9.11 A B C64 GHS reduction 9.44 B C D32 GHS reduction 9.56 B C D16 GHS reduction 9.72 C D64 retest-all 9.83 D32 retest-all 9.89 D16 retest-all 10.00 D2 retest-all 10.00 D4 retest-all 10.00 D4 GHS reduction 10.00 D1 retest-all 10.00 D8 retest-all 10.00 D

Table 22: Emp-server, granularity * technique, fault detection effectiveness, GHS reduction and retest-all.

Emp-serverSource: Granularity * Grouping * TechniqueDependent Variable: Fault Detection EffectivenessGranularity Grouping Technique Mean Homogeneous Groups8 Random GHS reduction 8.11 A2 Random GHS reduction 8.56 A B1 Random GHS reduction 8.56 A B1 Functional GHS reduction 8.56 A B32 Functional GHS reduction 9.33 B C64 Functional GHS reduction 9.44 B C64 Random GHS reduction 9.44 B C16 Random GHS reduction 9.44 B C64 Random retest-all 9.67 C2 Functional GHS reduction 9.67 C32 Functional retest-all 9.78 C32 Random GHS reduction 9.78 C8 Functional GHS reduction 9.89 C4 1 retest-all 10.00 C4 Functional retest-all 10.00 C8 Functional retest-all 10.00 C4 Functional GHS reduction 10.00 C16 Random retest-all 10.00 C16 Functional GHS reduction 10.00 C16 Functional retest-all 10.00 C8 Random retest-all 10.00 C32 Random retest-all 10.00 C4 Random GHS reduction 10.00 C2 Functional retest-all 10.00 C2 Random retest-all 10.00 C1 Functional retest-all 10.00 C1 Random retest-all 10.00 C64 Functional retest-all 10.00 C

Table 23: Emp-server, granularity * grouping * technique, fault detection effectiveness, GHS reduction andretest-all.

50

Emp-serverSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups64 GHS reduction 16.17 A32 GHS reduction 17.53 A B16 GHS reduction 18.24 A B8 GHS reduction 23.20 A B C64 retest-all 25.65 B C4 GHS reduction 30.06 C D32 retest-all 34.24 D2 GHS reduction 36.97 D1 GHS reduction 48.55 E16 retest-all 49.29 E8 retest-all 80.47 F4 retest-all 139.84 G2 retest-all 258.41 H1 retest-all 505.15 I

Table 24: Emp-server, granularity * technique, test execution time, GHS reduction and retest-all.

Emp-serverSource: Granularity * Grouping * TechniqueDependent Variable: Test Execution TimeGranularity Grouping Technique Mean Homogeneous Groups64 Random GHS reduction 14.89 A32 2 GHS reduction 15.94 A64 Functional GHS reduction 17.45 A16 Functional GHS reduction 18.08 A16 Random GHS reduction 18.41 A32 Random GHS reduction 19.11 A8 Random GHS reduction 20.44 A B4 Random GHS reduction 24.68 A B C64 Random retest-all 25.62 A B C64 Functional retest-all 25.68 A B C8 Functional GHS reduction 25.96 A B C2 Random GHS reduction 32.60 B C D32 Functional retest-all 33.92 C D32 Random retest-all 34.56 C D4 Functional GHS reduction 35.44 C D2 Functional GHS reduction 41.33 D E16 Functional retest-all 48.53 E1 Random GHS reduction 48.55 E1 Functional GHS reduction 48.55 E16 Random retest-all 50.06 E8 Functional retest-all 80.22 F8 Random retest-all 80.72 F4 Functional retest-all 137.22 H4 Random retest-all 142.46 H2 Functional retest-all 253.53 I2 Random retest-all 263.30 I1 Random retest-all 505.15 J1 Functional retest-all 505.15 J

Table 25: Emp-server, granularity * grouping * technique, test execution time, GHS reduction and retest-all.

51

BashSource: Granularity * TechniqueDependent Variable: Test Execution TimeGranularity Technique Mean Homogeneous Groups1 GHS reduction 68.44 A2 GHS reduction 97.67 A16 GHS reduction 131.56 A B4 GHS reduction 141.44 A B8 GHS reduction 160.33 A B32 GHS reduction 199.94 A B C64 GHS reduction 201.22 A B C64 retest-all 212.78 A B C32 retest-all 230.67 A B C16 retest-all 240.83 A B C8 retest-all 300.33 B C4 retest-all 366.44 C D2 retest-all 520.06 D1 retest-all 782.22 E

Table 26: Bash, granularity * technique, test execution time, GHS reduction and retest-all.

A.3 Test Case Prioritization

Emp-serverSource: Granularity * TechniqueDependent Variable: APFDGranularity Technique Mean Homogeneous Groups64 Coverage 92.22 A8 Coverage 94.79 AB16 Coverage 94.93 AB32 Coverage 96.12 ABC1 Coverage 96.70 BCD4 Coverage 97.87 CDE64 Optimal 98.41 CDE2 Coverage 99.04 DE32 Optimal 99.12 DE16 Optimal 99.60 E8 Optimal 99.79 E4 Optimal 99.89 E2 Optimal 99.94 E1 Optimal 99.97 E

Table 27: Emp-server, granularity * technique, APFD, Optimal and Coverage.

BashSource: Grouping * TechniqueDependent Variable: APFDGrouping Technique Mean Homogeneous GroupsFunctional Additional 86.13 ARandom Additional 89.20 BFunctional Optimal 98.69 CRandom Optimal 98.74 C

Table 28: Bash, grouping * technique, APFD, Optimal and Coverage.

52

Emp-serverSource: Granularity * Grouping * TechniqueDependent Variable: APFDGranularity Grouping Technique Mean Homogeneous Groups64 Functional Coverage 90.59 A8 Random Coverage 92.90 AB64 Random Coverage 93.86 ABC16 Functional Coverage 94.45 ABCD16 Random Coverage 95.40 BCDE1 Random Coverage 95.48 BCDEF32 Functional Coverage 95.90 BCDEFG32 Random Coverage 96.33 BCDEFG8 Functional Coverage 96.68 BCDEFG4 Random Coverage 97.79 CDEFG1 Functional Coverage 97.91 CDEFG4 Functional Coverage 97.95 CDEFG64 Random Optimal 98.38 DEFG64 Functional Optimal 98.43 DEFG2 Functional Coverage 98.65 DEFG32 Functional Optimal 99.04 EFG32 Random Optimal 99.20 EFG2 Random Coverage 99.43 EFG16 Functional Optimal 99.60 EFG16 Random Optimal 99.60 EFG8 Random Optimal 99.79 FG8 Functional Optimal 99.79 FG4 Random Optimal 99.89 G4 Functional Optimal 99.89 G2 Functional Optimal 99.94 G2 Random Optimal 99.94 G1 Functional Optimal 99.97 G1 Random Optimal 99.97 G

Table 29: Emp-server, granularity * grouping * technique, APFD, Optimal and Coverage.

BashSource: Granularity * GroupingDependent Variable: APFDGranularity Grouping Mean Homogeneous Groups64 Functional 77.88 A32 Functional 83.10 A64 Random 83.26 A16 Random 91.76 B8 Random 92.88 BC16 Functional 94.44 BC32 Random 95.28 BC8 Functional 96.92 BC4 Random 97.00 BC4 Functional 97.41 BC2 Functional 98.54 BC1 Functional 98.57 BC2 Random 98.68 BC1 Random 98.93 C

Table 30: Bash, granularity * grouping, APFD, Optimal and Coverage.

53

BashSource: Granularity * Grouping * TechniqueDependent Variable: APFDGranularity Grouping Technique Mean Homogeneous Groups64 Functional Additional 59.88 A32 Functional Additional 68.42 A64 Random Additional 70.22 A16 Random Additional 84.86 B8 Random Additional 86.43 BC16 Functional Additional 90.30 BCD32 Random Additional 93.02 BCD4 Random Additional 94.34 BCD8 Functional Additional 94.58 BCD4 Functional Additional 95.19 BCD64 Functional Optimal 95.88 CD64 Random Optimal 96.30 CD1 Functional Additional 97.23 CD2 Functional Additional 97.28 CD32 Random Optimal 97.53 D2 Random Additional 97.54 D32 Functional Optimal 97.77 D1 Random Additional 97.96 D16 Functional Optimal 98.58 D16 Random Optimal 98.67 D8 Functional Optimal 99.27 D8 Random Optimal 99.32 D4 Functional Optimal 99.63 D4 Random Optimal 99.65 D2 Functional Optimal 99.81 D2 Random Optimal 99.82 D1 Functional Optimal 99.90 D1 Random Optimal 99.90 D

Table 31: Bash, granularity * grouping * technique, APFD, Optimal and Coverage.

BashSource: Granularity * TechniqueDependent Variable: APFDGranularity Technique Mean Homogeneous Groups64 Diff-Coverage 66.97 A32 Diff-Coverage 75.66 AB4 Diff-Coverage 80.51 ABC16 Diff-Coverage 80.65 ABC8 Diff-Coverage 83.17 BCD1 Diff-Coverage 86.27 BCDE2 Diff-Coverage 92.44 CDE64 Optimal 96.09 DE32 Optimal 97.65 E16 Optimal 98.63 E8 Optimal 99.30 E4 Optimal 99.64 E2 Optimal 99.82 E1 Optimal 99.90 E

Table 32: Bash, granularity * technique, APFD, Optimal and Diff-Coverage.

54

On Test Suite Composition and Cost-Eﬁective...

Documents

Transcript of On Test Suite Composition and Cost-Eﬁective...