A Formal Interpretation of Software Testing as Inductive Inference

29
SOFTWARE TESTING, VERIFICATION AND RELIABILITY, VOL. 6, 3-31 (19%) A Formal Interpretation of Software Testing as Inductive Inference HONG ZHU Institute of Computer Sofnuare, Nanjing University, Nanjing, 21 0093 People’s Republic of China SUMMARY Software testing can be viewed as an inductive inference process during which the tester attempts to deduce software properties from its behaviour on a finite number of test cases. This paper investigates the foundation of software testing by interpreting the axioms of test adequacy criteria as properties of inductive inference. The interpretation manifests the conserva- tive and simplest hypothesis nature of the induction underlying software testing. It also yields results relating adequate testing to software correctness and reliability. The convergence of the inductive inference process is proved to be a condition of the correctness of tested software. By measuring the convergence according to the probability of an inference result being correct up to a given error rate, a new approach to software reliability estimation is proposed, which differs from existing ones in taking software complexity into account. KEY WORDS software testing; adequacy criteria; axioms; inductive inference; models of axiom systems; software reliability; program Correctness 1. MOTIVATION A fundamental question of software testing is whether a tested software system is correct. Attempts to answer this question have been made by a number of computer scientists. In the mid 1970s, Goodenough and Gerhart (1975, 1977) introduced the notion of software test adequacy criteria, which are rules to decide whether a test has been adequately performed. Goodenough and Gerhart required adequacy criteria to be reliable and valid. An adequacy criterion was said to be reliable if it always produces consistent test results. It requires that if a program passes one adequate test, it will pass all adequate tests. An adequacy criterion was said to be valid if it always produces meaningful results. It requires that if there is an error in the program under test, then there is an adequate test set that can reveal the error. They proved that such criteria can guarantee the correctness of adequately tested software. Unfortunately, it is proved that there is no computable criterion that is reliable and valid (Howden, 1976). Since then, many adequacy criteria have been proposed and investigated. They can be classified into a number of categories. A program-based criterion decides if a test set is adequate according to whether the program has been thoroughly exercised. For example, statement coverage decides if a test set is adequate according to whether all the statements in the program have been executed during testing. Other program-based criteria include CCC 09604833/96/010003-29 Received 12 December 1994 0 19% by John Wiley & Sons, Ltd. Revised 3 January 19%

Transcript of A Formal Interpretation of Software Testing as Inductive Inference

Page 1: A Formal Interpretation of Software Testing as Inductive Inference

SOFTWARE TESTING, VERIFICATION AND RELIABILITY, VOL. 6, 3-31 (19%)

A Formal Interpretation of Software Testing as Inductive Inference

HONG ZHU Institute of Computer Sofnuare, Nanjing University, Nanjing, 21 0093 People’s Republic of China

SUMMARY

Software testing can be viewed as an inductive inference process during which the tester attempts to deduce software properties from its behaviour on a finite number of test cases. This paper investigates the foundation of software testing by interpreting the axioms of test adequacy criteria as properties of inductive inference. The interpretation manifests the conserva- tive and simplest hypothesis nature of the induction underlying software testing. It also yields results relating adequate testing to software correctness and reliability. The convergence of the inductive inference process is proved to be a condition of the correctness of tested software. By measuring the convergence according to the probability of an inference result being correct up to a given error rate, a new approach to software reliability estimation is proposed, which differs from existing ones in taking software complexity into account.

KEY WORDS software testing; adequacy criteria; axioms; inductive inference; models of axiom systems; software reliability; program Correctness

1. MOTIVATION A fundamental question of software testing is whether a tested software system is correct. Attempts to answer this question have been made by a number of computer scientists. In the mid 1970s, Goodenough and Gerhart (1975, 1977) introduced the notion of software test adequacy criteria, which are rules to decide whether a test has been adequately performed. Goodenough and Gerhart required adequacy criteria to be reliable and valid. An adequacy criterion was said to be reliable if it always produces consistent test results. It requires that if a program passes one adequate test, it will pass all adequate tests. An adequacy criterion was said to be valid if it always produces meaningful results. It requires that if there is an error in the program under test, then there is an adequate test set that can reveal the error. They proved that such criteria can guarantee the correctness of adequately tested software. Unfortunately, it is proved that there is no computable criterion that is reliable and valid (Howden, 1976).

Since then, many adequacy criteria have been proposed and investigated. They can be classified into a number of categories. A program-based criterion decides if a test set is adequate according to whether the program has been thoroughly exercised. For example, statement coverage decides if a test set is adequate according to whether all the statements in the program have been executed during testing. Other program-based criteria include

CCC 09604833/96/010003-29 Received 12 December 1994 0 19% by John Wiley & Sons, Ltd. Revised 3 January 19%

Page 2: A Formal Interpretation of Software Testing as Inductive Inference

4 HONG W U

branch coverage, mutation adequacy (DeMillo et al., 1978; Howden, 1982; Woodward and Halewood, 1988), various data flow adequacy criteria (Laski and Korel, 1983; Ntafos, 1984; Rapps and Weyuker, 1985; Frankl and Weyuker, 1988), and many others. A specification-based criterion decides if a test set is adequate according to whether the software requirements or functional specification has been taken into full consideration in testing the software. Examples of specification-based adequacy criteria include mutation adequacy of algebraic specification (Woodward, 1993). Combining these two approaches gives rise to combined specification and program based criteria, which use the ideas of both program-based and specification-based criteria. There are also test adequacy criteria that decide test adequacy without employing any internal information from the specification or the program. For example, test adequacy can be measured according to the prospective usage of the software by considering whether the test cases cover the data that are most likely to be frequently used as input in the operation of the software. Although there are few criteria explicitly proposed in such a way, selecting test cases according to the usage of the software is the idea underlying random testing (or statistical testing). In random testing, test cases are sampled at random according to a probability distribution over the input space. Such a distribution can be the one representing the operation of the software, and the random testing is called representative. It can also be any probability distribution, such as a uniform distribution, and the random testing is called non-representative. Generally speaking, if a criterion only employs the ‘interface’ information, it is called an interface-based criterion. Here, interface information is the information about the interface between the software and its user and environment, such as the type and range of software inputloutput and the probability distribution over the input space, etc. Readers are referred to the work of Zhu et al. (1994) for a survey of test adequacy criteria.

It should be noticed that, although for all testing methods the correctness of program outputs must be checked against the specification or the requirements, the adequacy of a test set is independent of the correctness of software output on the test cases (Weyuker, 1986). For instance, a test set is considered adequate according to statement coverage if all statements in the program under test are executed during testing, no matter whether the output on each test case is correct. The role of test adequacy in software testing is that if a test set is adequate and the program executes on the test cases successfully, then the software should be close to being correct. Here, testing being successful means that the program produces correct outputs on the test cases. However, this role of test adequacy has not been fully justified formally. This is the subject of this paper.

To assess existing adequacy criteria, as well as to formalize the notion of adequacy, Weyuker (1986, 1988) proposed a set of axioms, The axiom system was refined, formalized and studied by Parrish and Zweben (1991, 1993). Since test adequacy is often measured by the percentage of code coverage, Zhu and Hall (1993) proposed another axiom system on the basis of the mathematical theory of measurement. In their axiom system, an adequacy criterion was considered as a function from a triple of a test set, a program and a specification to a real number that represents the degree of test adequacy of testing the program against the specification by using the test set. To date, axiomatic studies of adequacy criteria have focused on the formalization and clarification of the notion of adequacy and the investigation of relationships between the axioms and the inferable properties. A number of interesting results have been obtained, such as the applicable data flow adequacy criteria (Frankl and Weyuker, 1988) and the irregularity theorem (Zhu et al., 1995). However, the fundamental question-that is, the relationship between the

Page 3: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 5

criteria characterized by the axioms and the reliability and correctness of the tested software-is still an open problem.

To answer this question, one must inevitably ask what an ‘adequate test’ means. The axioms of software test adequacy try to characterize the notion of adequacy by its fundamental properties. However, the axioms do not directly answer what the meaning of adequacy is. This paper will apply the model theory of mathematical logic (Chang and Keisler, 1973) to provide interpretations of the axioms characterizing the notion of adequacy. According to model theory, a model of an axiom system is a mathematical structure that satisfies the axioms. It interprets the meanings of the formal axioms with the mathematical structure. Hence, semantics are assigned to the axiom system. Software testing can be viewed as an inductive inference process during which the tester attempts to deduce the properties of a software system on the whole input space by observing its behaviour on a finite number of test cases (Weyuker, 1983; Zhu et al., 1992). Therefore, the specific interpretations proposed in this paper are based on the theory of inductive inference so that the implicit induction underlying the software testing process is made explicit.

Recent years have seen a rapid growth of interest in the interpretation of traditional logic systems in the context of computer science. For example, in the study of type theory, a type was interpreted as a statement and the elements of the type were interpreted as proofs of the statement. In the context of computer science, a type can be interpreted as a formal specification and an element of a type can be interpreted as a program that satisfies the specification. Therefore, a formal software development method can be developed based on type theory; see, for example, the work of Barendregt and Nipkow (1994). This paper will show that interpreting axioms of adequacy criteria with inductive inference yields results relating adequate testing to software correctness and reliability.

Gold’s ‘identification-in-the-limit’ model of inductive inference (Gold, 1967) will be used to construct a model of Weyuker’s axioms of test adequacy criteria, and to interpret test adequacy as convergence of the induction process. Within this model, it can be proved that an adequately and successfully tested software system is correct if the program and the specification are all learnable by an inductive inference machine. The construction of the model also manifests that the inductive inference underlying software testing has the conservative property, which means that conclusions drawn from test results should not be changed unless new observations are found to be inconsistent with those conclusions. It also has the property of ‘simplest hypothesis’, which means that conclusions drawn from test results should only concern the tested part of the software. Zhu and Hall’s axioms of test adequacy measurements will also be interpreted by inductive inference, but using Valiant’s ‘PAC inductive inference’ model (Valiant, 1984). Test adequacy is then interpreted in terms of the reliability of the function that can be derived from the number of software input/output instances. This interpretation enables one to introduce the complexity of the software under test into software reliability estimation. Consequently, a new approach to software reliability estimation is proposed that differs from existing reliability estimation methods in taking software complexity into account. It also enables one to apply fault-based testing to random testing.

The paper is organized as follows. Section 2 constructs a model for Weyuker’s axiom system and discusses the relationship between testing and software correctness. Section 3 constructs a model for Zhu and Hall’s axiom system and studies the relationship between adequacy measurement and software reliability. Section 4 gives some concluding remarks

Page 4: A Formal Interpretation of Software Testing as Inductive Inference

6 HONG ZHU

of the paper. The results of the paper and their practical implications are summarized. Related work and further work are discussed.

2. INDUCTIVE INFERENCE MODEL OF WEYUKER’S AXIOM SYSTEM

2.1. The axiom system Weyuker (1986) proposed a set of axioms to characterize the notion of test adequacy.

Originally Weyuker’s axioms were given informally. The formal expression used in this paper is partly based on the work of Parrish and Zweben (1991). The axioms were for program-based test data adequacy criteria. Software specifications were not considered. Therefore, a program-based adequacy criterion C is formally defined as a predicate defined on a given program space P and the class T of test sets. In this paper, the test sets are assumed to be the subsets of a countable universal data space D, i.e. T = 2 O and the program space P is a subset of the computable total functions on D, i.e. P G D - D.

The most fundamental axioms in Weyuker’s system are finite applicability and monoton- icity.

Axiom Al: Finite applicability For all programs p , there exists a finite adequate set of test cases. Formally,

Axiom A2: Monotonicity

adequate for testing p. Formally, If a test set t is adequate for testing program p, and t is a subset of t’, then t’ is also

V p E P . Vt, t’ E T - (C( t ,p ) A ( t c t ‘ ) * C(t’ ,p)) (2)

From these two axioms, it can be proved that the exhaustive test set D is always adequate, i.e. V p E P . (C(D, p) ) . However, the other extreme case that the empty test set is always inadequate cannot be deduced from the axioms. Hence, it is given as an axiom.

Axiom A3: Inadequacy of the empty test set The empty test set is not adequate for testing any program. Formally,

In the axiom system, Weyuker (1986) asserts that neither the semantics nor the syntactic structure of a program alone can determine the adequacy of a test set. She also asserts that the adequacy of testing a program does not imply the adequacy of testing a component

Page 5: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 7

in the program, nor does the adequate testing of all the components imply the adequacy of testing the whole program. These axioms have been controversial (Zweben and Gourlay, 1989). Hence, they will not be considered in this paper. Aware of the inadequacy of the axiom system, Weyuker (1988) proposed three additional axioms. They are complexity property, statement coverage property and renaming property.

Axiom A4: Complexity property For every natural number n, there is a program p such that any test set of size less

than n is not adequate to test p. Formally,

Axiom A5: Statement coverage property

statement of p to be executed. Formally, If a test set t is adequate for testing a program p then t causes every executable

Vt E T , p E P . ( C ( t , p ) * S C ( r , p ) ) , (5 )

where SC ( t , p ) means every executable statement in p is executed during the execution of p on test cases in t.

Let p be the program obtained by systematically renaming some variables in a program q. The ‘renaming property’ states that a test set t is adequate for testing p if and only if t is adequate for testing q. In this paper, programs that are equivalent up to renaming the variables will be considered as the same program. Therefore, the renaming property will not be considered as an axiom.

2.2. Construction of the model This subsection will provide a model of the axiom system given above, where the term

‘model’ is in the sense of model theory in mathematical logic. According to the theory, a model is a mathematical structure that satisfies the axioms. The construction of a model means assigning the specific elements in the mathematical structure to the symbols used in the axioms. These elements provide an interpretation of the symbols. Hence, meanings are assigned to the axioms. In this way, the semantics of a formal axiom system are formally defined. The model to be constructed is based on a theory of inductive inference.

Inductive inference is the deduction of a rule from a finite number of instances of the rule. It is also called ‘learning from examples’ in the literature. There are a number of inductive inference methods (see the work of Angluin and Smith, 1987, and Smith and Angluin, 1983, for surveys of the research in the area). A traditional model of inductive inference is Gold‘s ‘identification in the limit’ (Gold, 1967). It views inductive inference as an infinite process. An inductive inference device A4 is supposed to run repeatedly on larger and larger collections of instances of a given rule. Each time a new instance is input into the device M, it produces a hypothesis of the rule. Therefore, an infinite sequence of hypotheses will be generated, say fi,f2,...,f,,... If, after generating a finite

Page 6: A Formal Interpretation of Software Testing as Inductive Inference

8 HONG ZHU

number of hypotheses, M does not change the output any more, then the inference process is said to converge. Of course, an inductive inference process may converge to a wrong hypothesis. Whether a hypothesis is correct is determined by a correctness criterion.

This model of inductive inference can be applied to various kinds of rule, such as grammar rules and Boolean functions. Without loss of generality, in this paper rules will be computable total functions with a countable set D as domain and co-domain. An instance of a rule f is a pair of elements (x, y ) of D such that Ax) = y. For a finite subset X = {xl, x2, ..., x,} , a function 4: X-+ I) can be represented as a set of ordered pairs ((xl, &xl)), (x2, &x2)), ..., (x,,, +(xn))) . It can be considered as an instance set of a rule f : D - D .

Definition 1: Inductive inference device

- D) such that for all functions (9: X - D, An inductive inference device M is a function M:U (X - DIX D A 1” < m} - (D

where X- D is the set of functions from X to D, and f 1 X is the function f restricted to a subset X of its domain.

Notice that, equation(6) imposes a condition on the inductive inference device M , which is sometimes called the ‘consistency property’.

Definition 2: Identification in the limit Let M be an inductive inference device, and a = a,, a2,. .., an, ... be an infinite sequence

of instances of a given rule f. Let f n = M ( { a , , az, ..., a,}). If there is a natural number K such that for all n, m 2 K, f, =fn, then it is said that M converges to f K on a and that M behaviourally identijies f correctly in the limit, or simply, f is behaviourally learnable by M . If the f K = f , it is said that M explanatorily identijies f correctly in the limit. A set P of rules is said to be behaviourally (or explanatorily) learnable by M, if for all f E P, f is behaviourally (or explanatorily) learnable by M.

Notice that, if p is behaviourally learnable by M and the inductive inference process converges to 5 then p =$ If a set of rules is explanatorily learnable, then it must be behaviourally learnable. The set of one-variable polynomials is an example of an explan- atorily learnable set of functions. Readers are referred to the work of Case and Smith (1983) for more examples of learnable rule sets.

The notions of software testing can now be interpreted in the terminology of inductive inference as follows, A program under test is interpreted as a rule to learn. A test case is interpreted as an instance of the rule. A test set is then a set of instances. The statement ‘A test set is adequate to test a program’ is interpreted as ‘a rule can be learned from the set of instances’. With these interpretations, the axioms of adequacy criteria can be interpreted accordingly. For example, the inadequacy of the empty test set can be interpreted as ‘no rule can be learned from the empty set of instances’. Table1 gives the informal interpretations of the axioms in Weyuker’s system according to the inductive inference model.

Page 7: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 9

Table I. Interpretation of Weyuker’s axioms of test adequacy

Axiom Interpretation

Finite applicability All rules can be learned from a finite number of instances Monotonicity If a rule can be learned from a set t of instances, then it can be

learned from any instance set that contains t

Inadequacy of the empty set Complexity property

No rule can be learned from the empty set of instances For any natural number n, there is a rule that can be learned from

a set of n instances, but not from any set of instances less than n

If a rule p (program) can be learned from a set t of instances, the execution of the program on the input data in t will cover all the statements in p

obtained by renaming the variables in the rule can always be learned from the same set of instances

Statement coverage property

Rename property If a rule can be learned from a set of instances, then a rule

Formally, given an inductive inference device M, the following function will be used to interpret adequacy criteria.

Definition 3: Inductive inference adequacy criterion induced from M Let M be an inductive inference device. The function C,,,(t,p) defined by the following

equation is called the inductive inference adequacy criterion induced from M:

C d t , p ) = true - M(p 1 t ) = p (7)

where t To complete the construction of the model, it must be proved that the axioms are

satisfied by the interpretation. However, not all functions C, induced from an arbitrary M can be considered as interpretations of test adequacy criteria. Therefore, certain properties must be imposed on inductive inference devices. The properties considered here are the conservative property and the simplest hypothesis property. As will be seen later in the paper, they are the necessary and sufficient conditions in order that a model can be constructed on an inductive inference machine. They are formally defined as follows.

D is a test set and p is a function from D to D.

Definition 4: Conservative and simplest hypothesis properties An inductive inference device M is said to be conservative, if for any sequence of

input instances, M outputs a hypothesis different from its previous output only if the new instance is incompatible with the previous hypothesis. M is said to have the simplest hypothesis property if it always produces the simplest hypothesis consistent with the input instances.

Program synthesis systems, such as those of Zhu and Jin (1991) and Hutchinson (1994), often have these properties. Here, a program synthesis system is a software system that

Page 8: A Formal Interpretation of Software Testing as Inductive Inference

10 HONG ZHU

takes a finite set of input/output pairs as a specification and generates a program that is compatible with the input/output pairs.

Lemma 1 Let M be an inductive inference device.

(1) If M is conservative, C, is monotonic. Formally,

Vp E P * Vt, t’ E T . ( C d t , ~ ) A ( t t’) 3 C&t‘,p)) (8)

(2) If the program space P is explanatorily learnable by M, C,,,, is finitely applicable. For- mally,

Proof

(1) Let C&p). By Definition 3, M(p 1 t ) = p . By the conservative property of M, it follows that M(p 1 t’) = p for all t C t’. That is, C d t ’ , p).

(2) Let p E P. Because P is explanatorily learnable by M, by Definition2 there is a natural number K such that M(p 5 (a,, u,, ..., a, ...}) converges to some function f K = M ( p 1 (u, , a2,.. ., uK)). Since p is explanatorily learnable by M, it follows that

0 f K = p . That is, M@ 1 {al, a2 ,..., uK}) = p . Therefore, C,,,,({u,, a, ,..., aK), p ) .

Notice that the conservative property is also a necessary condition for the monotonicity of C,. The learnability of P is also a necessary condition for C,’s finite applicability.

Lemma 2 (Weyuker, 1983) Let M be an inductive inference device. If M has the simplest hypothesis property and

M(p 1 t) = p , then the execution of p on test cases in t covers all the branches in p. Readers are referred to the work of Weyuker (1983) for the proof of the lemma. Because the empty set does not cover any branch in a program and branch coverage

subsumes statement coverage (if a program has no branch statement, it is regarded as containing one branch), it is easy to prove the following properties by using Weyuker’s lemma.

Lemma 3

P is explanatorily learnable by M, then C, has the following properties: Let M be an inductive inference device. If M has the simplest hypothesis property, and

(1) inadequacy of the empty set, i.e. V p E P . (lC,,,,(0,p)) (2) statement coverage, i.e. V p E P - Vt E T . (C,,,(t,p) 3 SC ( t , p ) ) (3) complexity property, i.e. Vn . [n E N 3 3p E P * V f E T - (C,,,,(t,p) 1141 2 n)I

Page 9: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 1 1

Proof The proof of (1) and (2) using Lemma 2 is straightforward. 0 (3) Consider the program schema pn in Figure 1. Any test set that covers all the

branches in pn must contain at least n test cases. By Weyuker’s lemma, if Cdt,pn). t 0 must cover all branches in pn. Therefore, \\dl 2 n.

Theorem 1

satisfies the axioms A1 to A5 if M has the following properties: Let M be an inductive inference device. The function C d t , p) defined by equation (7)

(1) M is conservative; (2) M has the simplest hypothesis property; (3) P is explanatorily learnable by M.

Proof By Lemma 1 to Lemma 3. 0 It is easy to see from the proofs of the lemmas that the three conditions are necessary.

If M does not satisfy all three conditions, then C, will not satisfy all the axioms. Therefore, these conditions manifest the nature of the inductive inference process underlying software testing. The conservative property implies that the conclusion drawn from test results should not be changed unless new observations are found inconsistent with the conclusion.

[be&

[ inputx 1

... ...

output n e Figure 1. A program schema that needs at least n test cases

Page 10: A Formal Interpretation of Software Testing as Inductive Inference

12 HONG ZHU

The simplest hypothesis property implies that the conclusion drawn from test results should only concern the tested part of the software.

2.3. Relationship between adequate testing and software correctness The purpose of software testing is to validate the software against its requirements.

Hence, a fundamental question in software testing research is how an adequate test is related to the correctness and reliability of software passing the test. Given that there is no computable test adequacy criterion that can guarantee without any conditions the correctness of tested software (Howden, 1976). this section searches for the condition under which correctness can be guaranteed.

In an inductive inference process, an inductive inference device may generate a sequence of hypotheses. Such a hypothesis is a correct identification of the rule to be learned only if the inductive inference process has reached a convergence point. Similarly, for software testing, the correctness of the tested software can be guaranteed if the inductive inference process underlying the testing process converges.

Theorem 2 Let M be a conservative inductive inference device, p be explanatorily learnable by M,

s be behaviourally learnable by M and t be a test set. The program p is correct with respect to specification s if

(1) the test set t is inductive inference adequate for testing p ; i.e. C d f . p) =true; (2) the program p is successfully tested on t with respect to s; i.e. Vx E t . (p(x) = ~ ( x ) ) ; (3) M converges to some function f on ((ul , ~(u , ) ) , (u2, s(u2)), ..., (un, ~(u,,)),...) and

f=M(I(a,, dad), ( ~ 2 , ~(4),---, ( a n , s(a,)>I), where {a , , ~ 2 , . . . , an1 C t.

Proof By Definition 3, it follows that

Since s is behaviourally learnable by M, it follows from condition (3) that

Because M is conservative and {a,, u ~ , . . . , u,,} C t, it follows that

By condition (2), s 1 t = p 1 t. Therefore, s = p.

Page 11: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 13

Example 1 Let P be the set of programs in the following form:

begin input(x); if FU, then y:=l elsif F U ~ then y:=2 ... elsif m k then y:=k else y:=O endif; outputb)

end

where k > 0 and al, a2,.. ., ak are natural numbers. Define an inductive inference machine M as follows.

where (xj, yj) E ( (xl, y,), (x2, y2),. .., (x,,, y,,)); if yi > 0 then (xi, yi) E ((x$ yj)b = 1 , 2 ,..., k}; and for all j = 1,2,. .., k - 1, 0 c yj < Y;+~; (al + bl , a2 -+ b2,.. . , a k + bk) denotes the fol- lowing program:

begin input@); if x=al then y:=bl elsif outputol)

then y:=b2 ... elsif x - d k then y:=b, else y : d endif;

end

Obviously, P is explanatorily learnable by identification in the limit. Let p be a program in P and s be a specification in P. Let t be a test set. The following explains the meaning of the conditions given in Theorem2.

(a) The adequacy of a test set r for testing p means that M ( p 1 t ) = p . It is easy to see that a test set t is adequate for testing a program p in P according to C, if and only if it contains all the elements a,, a2, ..., ak of p. Therefore, condition (1) means that t contains all the points on which p(x) is not equal to 0.

(b) Condition (2) means that the program produces correct output with respect to s on all points in t .

(c) Condition (3) requires that M converges to some program f on

and {x1,x2, ..., x n ) t. This means that f also contains all the points x on which s(x) is not equal to 0.

By Theorem 2, it follows that p is correct with respect to s on all input data. In fact, the correctness can be derived directly from the fact that t contains all the points on which p or s is not equal to 0.

Notice that, firstly, generally speaking, in the model of identification in the limit it may not be decidable whether an inductive inference process has converged on a set of instances. Therefore, Theorem 2 does not conflict with the non-existence of effective adequacy criteria that can guarantee software correctness.

Secondly, if a function set P is explanatorily learnable by M, then for every program

((XI, ~(xl)), ( ~ 2 9 ~ ( ~ 2 ) ) r * . * r (xn, s(xn)),.**I where f=M({(xl, s(xl)), ( ~ 2 , ~ ( x d ) , . . . , (xn, d-4)))

Page 12: A Formal Interpretation of Software Testing as Inductive Inference

14 HONG ZHU

p in P , M will converge on some finite set of instances of p . Theorem2 implies that every function in P can be tested by a finite number of test cases to guarantee correctness. In other words, learnability implies testability. This coincides with the intuition that testing is easier than learning.

Finally, Theorem2 requires that the program is actually generated by an inductive inference device. This is not what happens in software testing practice. Fortunately, the following theorem proves that this is unnecessary if both the program and the specification are behaviourally learnable by identification in the limit.

Theorem 3 Let D = {xlr x2,..., x,,, ...) and P be a set of functions defined on D. If P is behaviourally

learnable by identification in the limit, then there is a function K:P - N such that for all p , s E P , p = s if and only if for all i 5 max ( K @ ) , K(s)) , p(xi) = s(xi). The smallest integer- valued function K:P - N is called the learning complexity.

Proof Let r E P. Since P is behaviourally learnable, by Definition 2 there is an inductive

inference device M such that M converges to r on ( ( x l , r(xl)) , ..., (x,,, r(x,,)), ...). Moreover, there is a natural number K, such that M({ (xl, T(xl)), ..., (x.,, r (xK)) ) ) = fr = r and for all n > K,., M ( ( ( x l , r(x,)), ..., (xn, dx,,)))) = f,. Define K ( T ) to be the minimal K, that has the property.

Let K = max ( K @ ) , K(s)). The following proves that p = s if and only if for all i 5 K, p(xi) = s(xj). Obviously p = s implies p(xi) = s(xJ for all i I K. Therefore, only the other direction needs a proof.

Because for all i 5 K, p(x,) = s(xi), it follows that

I(x,, p(x1)),-**9 (XK9 P(XK))) = {(Xl, s(xI)),*.*, (XK, s (xK) ) } (1 1)

Let q = M({ (xi, p(xl)) , . . ., (xK, p(xK))}) . By equation (1 l), it is also the case that

q = M(((X1, s(x1)),.*., (XK, s(xK))}) (12)

Since K 2 K@) , by the definition of the function K , it follows that p = q. Similarly, q = s 0

Informally, Theorem 3 states that if both program p and specification s belong to a set of rules that are learnable by identification in the limit, the correctness of the program p with respect to s can be tested by a finite number of test cases. Generally speaking, the number of test cases may not be computable from the program and the specification, because the learning complexity function K may not be computable. Notice that the learning complexity depends on the inductive inference device and the ordering of the elements in D.

follows from K 2 K(S) and equation (12). Therefore, p = s.

Example 2 Consider the set P of programs defined in Example 1. Let D = (0, 1,2,3, ... 1. It is

obvious that for any p in P , K@) = max{a,, a2,..., ak) is the learning complexity function defined in Theorem 3.

Page 13: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 15

In recent years, Bougt5 et al. (1986) and Bernot et al. (1991) have introduced software testing hypotheses as conditions for the correctness of adequately tested programs. One of their testing hypotheses, the ‘regularity hypothesis’, states that a program is correct on all inputs if the program is correct on all test cases of complexity less than a given degree. This hypothesis certainly captures the intuition of the induction underlying software testing. As an example of the application of Theorem3, the following corollary validates the regularity hypothesis. Let 1.1 be a function from D to natural numbers. 1x1 gives the complexity of the element x in D.

Corollary 1 Let P be a behaviourally learnable set of functions. Let 1.1 be a complexity measure of

the elements in D such that for any natural number N the subset ( x E DI IxlSN) is finite. Then, for all functions p, s E P, there exists a natural number N such that p = s if and only if p(x) = s(x) for all x E D where IxlSN.

Proof First, enumerate the elements in D to be xl, xz,...,xn, ... in such a way that IxilSlxjl if

i < j . Since P is learnable, by Theorem 3 there is a learning complexity function K. Let K = max (~(p), ~(8)). Now, define N = maxiSK (b,l). The corollary follows from Theorem 3 . 0

A practical implication of the theorem is that the correctness of a software system can be validated by testing without writing down a formal functional specification. This lays a foundation for the current practice of software testing without a formal specification. However, the theorem requires that the functional specification belongs to a particular set of functions which is inductively learnable and includes the function implemented by the program. It seems that what current testing practice lacks is an analysis of the ‘complexity’ of the program so that a range can be determined within which the program and the specification vary, such as was done in Example 1 and Example 2.

3. INDUCTIVE INFERENCE MODEL OF ZHU AND HALL’S AXIOM SYSTEM As discussed in Section 2, it is undecidable when an inductive inference process converges in the model of identification in the limit. This problem can be solved by using Valiant’s ‘probably-approximately-correct’ inference model (Valiant, 1984). In Valiant’s model, the convergence of an inference process is measured by two parameters, one representing the approximation of the correctness of the inference result, the other representing the confidence in the result. This model will be used to interpret Zhu and Hall’s axioms of adequacy measurements (Zhu and Hall, 1993).

3.1. The axiom system Zhu and Hall’s axiom system is based on the mathematical theory of measurement.

Test adequacy criteria are considered as measurements rather than predicates. Consequently, they are defined as functions from a triple of a test set, a program and a specification to

Page 14: A Formal Interpretation of Software Testing as Inductive Inference

16 HONG ZHU

a real number in the unit interval. Let C be such an adequacy measurement. The adequacy of testing a program p against a specification s on a test set t is written as C;(t). The more adequate the test is, the greater the value C;(t). The axiom system consists of the following axioms.

Axiom B1: Inadequacy of an empty test The empty set is inadequate as a test set for all software. Formally,

v s E s, p E P.(q(0)=0)

Axiom B2: Adequacy of exhaustive testing The exhaustive test set is adequate for all software. Formally,

vs E s, p E P * (Ci(D) = 1)

Axiom B3: Monotonicity The more test cases are used, the more adequate the test. Formally,

Axiom B4: Law of diminishing returns

adequacy in the context. Formally, The more a program has been tested, the less a test set can further contribute to test

where the adequacy of a test set t in the context c, written as C;(t.(tlc), is defined by the following equation:

Axiom B5: Convergence For all sequences of test sets t , , t2 ,..., t ,,...,

Finite applicability, which is defined by the following definition, is not an axiom in this system, but can be derived from the axioms of Zhu and Hall (1993).

Page 15: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 17

Definition 5: Finite applicability An adequacy criterion is finitely applicable if, for all degrees of adequacy less than

1, there always exists a finite set of test cases that achieves the required adequacy degree. Formally,

In addition to the above axioms, which are to be satisfied by all test adequacy measure- ments, Zhu et al. (1993) also proposed some properties to characterize different approaches to adequacy measurement. One such property is context sensitivity, which has been shown to be the property that distinguishes structural software testing from random testing.

Definition 6: Context sensitivity

specifications s and test sets t, u and v, A test data adequacy measurement C is said to be context free, if for all programs p,

vs E s, p E P . vt, U, v E T . ( ( t n u = 0 A t n v = 0) (c; (tlu) = c; (tlv)))

Otherwise, it is called context sensitive. C is said to be weakly context free if, for all programs p, specifications s and test sets t, u and v,

Otherwise, it is called strongly context sensitive. Zhu et al. (1993) argued that adequacy measurements for structural testing methods

were context sensitive, while adequacy measurements for random testing were context free or weakly context free.

Notice that the axioms are all universally quantified for all programs and specifications. For the sake of convenience, subsequently, the subscript p and the superscript s in C;(t) will be omitted.

3.2. Construction of the model The model of Zhu and Hall's axiom system will be based on Valiant's probably-

approximately-correct (PAC) inductive inference protocol (Valiant, 1984). A PAC inductive inference machine takes two parameters as input, a tolerable error rate E and the required confidence 6 in the inference result, where both E and 6 are positive real numbers less than 1. The machine then works out the required number of instances, and asks the environment to generate n instances of the rule at random according to a given probability distribution over the input space. This probability distribution is arbitrary and unknown by the inductive inference machine. Finally, the machine produces a hypothesis of the rule.

The inference is said to be successful in learning a rule r if it produces a hypothesis p such that, with likelihood at least 6, he,,,, (p) is at most E, where Pr,,,, (p) is the

Page 16: A Formal Interpretation of Software Testing as Inductive Inference

18 HONG W U

probability of p being incompatible with the next instance generated at random. Formally, a PAC inductive inference machine can be defined as follows.

Definition 7: PAC inductive inference machine A probably-approximately-correct (PAC) inductive inference machine M = (M, +,,.,) con-

sists of an inductive inference device M as defined in Definition 1 and a monotonic function $,,.,,

+,:(O, 1) x (0, 1) - { 0, 1 ,..., n ,... ) Let P be a set of functions f: D - D. P is said to be PAC inferable by M, if for any E, 6 E (0, l), and for all r E P, when t is a set of instances of r such that I I~I~+AE, 6) and t is sampled at random according to a given distribution Pr on D, the likelihood that the following inequality holds is greater than or equal to 6:

where f = M ( r ) . The smallest such integer-valued function +A&, 6) is called the sample complexity of M.

The parameters E and 6 represent the inference quality. Of the two quality parameters, the error rate is more interesting because it is directly related to the reliability of the inference result. The function +,+, of a PAC machine characterizes the inductive inference power of the machine. For the same parameters (E , 6), the more powerful an inductive inference machine, the fewer input instances are required. On the other hand, given an inductive inference machine, the size of the set of instances generated at random will determine the ‘quality’ of the inference result. These quality parameters can be considered as a measurement of the convergence of the inference process. The theory of inductive inference has shown that the function +A&, 6) also depends on the set P of rules to learn.

Haussler (1992) developed a theory about the learnability of decision rules as a generalization of the PAC model. The following lemma is a special case of Haussler’s Theorem 8.

Lemma 4 (Haussler, 1992) A permissible’ set P of functions from a set Z to real numbers R is PAC learnable if

the pseudo dimension of P is finite. In such cases, the sample complexity function +&I 6) is

(dim, (P) In (UE) + In (US))

where dimp(P) is the pseudo dimension of the set P of functions.

I Permissibility is a notion of measurability which will not be of concern here in practical use of the theorem.

Page 17: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 19

Proof The lemma is a special case of Haussler’s Theorem 8 with the ‘PAC settings’ a = 1/2

and v = &. In particular, equation (21) is derived from Haussler’s equation (6) with his parameter M = 1. This is justified by Haussler (1992, p. 117) on the grounds that the standard discrete loss function is monotonic if it takes the value 0 when the function generated by an inductive inference machine provides correct output and it takes the value

0 Informally, the pseudo dimension of a set of functions is the richness of the functions. The following is a formal definition of the notion.

1 when it produces incorrect output.

Definition 8: Pseudo dimension For a set P of functions from a set Z into the set of real numbers R, the pseudo

dimension of P , written dimp(P), is the largest k such that there exist two sequences of length k, X = ( x l , x2,. . ., xk) E 2 and U = (ul, u2,-. ., uk) E Rk such that for any sub-sequence Y = (xi , , xi,, ..., xi j ) of X there exists f E P such that f ( x i ) + ui > 0 for all xi E Y, and

f ( x i ) + ui 5 0 for all xi @ Y. If no such k exists, then dim, (P) is infinite. Since this definition requires the set of functions mapping into real numbers, it is

assumed that the data space D is the set of rational numbers in the remainder of the paper. Notice that from Haussler’s result (Lemma4). it is easy to see that +(E, 6) functions

have the following common features. For any fixed value of 6,

(A) as E decreases, +(&, S) increases monotonically; (B) when n approaches m, the E satisfying the equation &&, 6) = n approaches 0.

When the sets of instances of a rule are interpreted to be test sets as in the previous section, the ‘quality’ of the inductive inference result can be used to interpret the ‘quality’ of testing, i.e. the adequacy of a test set. The axioms of test adequacy can then be interpreted accordingly. For example, the monotonicity axiom can be interpreted such that the more instances are used in the inductive inference, the higher the quality of the inference result. The interpretations of other axioms and properties are given in TableII.

The following constructs a formal model of the axiom system. Let S be a given positive real number less than 1 and M = (M, &,) be any given PAC inductive inference machine. The following function will be used to interpret test adequacy measurement.

Definition 9: Inductive inference adequacy measurement induced from M

Let 6 be a fixed real value in the interval (0, 1). The following function KM,8 is called the inductive inference adequacy measurement induced from the PAC inductive inference machine M.

By definition sup (0) = 0. The following theorem proves that the function defined above satisfies Axioms B1 to

Page 18: A Formal Interpretation of Software Testing as Inductive Inference

20 HONG ZHU

Table 11. Interpretation of Zhu and Hall’s axioms of test adequacy measurements

Axiodpropert y Interpretation

Inadequacy of an empty test

Adequacy of exhaustive

Monotonicity Law of diminishing returns

testing

Convergence

Finite applicability

Context free property

Weakly context free property

The empty set of instances cannot achieve any quality of inductive

The set of all instances of a rule guarantees the highest quality of

The more instances used, the higher the quality of the inference The more instances that have been used, the less an additional set

of instances can further improve the quality of inference When a sequence of instances is used in inductive inference, the

quality of the final result equals the limit of the quality of the intermediate results

For any given quality requirement, there is always a finite set of instances from which a result can be obtained to satisfy the quality requirement

The improvement of the inference quality gained by using an additional set of instances is independent of the instances used in the previous inductive inference process

additional set of instances depends on the size of the instance set used in the previous inference process, but not on the contents of the set

inference

the inductive inference result

The improvement of the inference quality gained by using an

B5. Let M = (M, $,,,) be a PAC inductive inference machine, where +,,, has the properties (A) and (B).

Theorem 4 The function KM,s defined in Definition 9 satisfies Axioms B1 to B5. Formally,

(1) satisfies the property of adequacy of exhaustive testing, i.e. KM,,(D) = 1; (2) KM,6 is monotonic with respect to test set inclusion, i.e. u C v * K,,,,&(u) I K,,.,*(v); ( 3 ) KM,* satisfies the law of diminishing returns, i.e. c C d * KM,,(tld) 5 KM,,(tlc); (4) KM,* satisfies the property of inadequacy of an empty test, i.e. K,,,,@) = 0; ( 5 ) satisfies the convergence property, i.e. for all sequences of test sets

t l , t2, .**, tn,**.,

Proof

(1) Let t be the exhaustive test set, i.e. t = D. The set t is infinite since D is assumed

Page 19: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 21

to be a countable infinite set. Since for any given E and 6, +(c, 6) < 03, it follows that ( E I + ( E , 6) 2 a) = 0. By Definition 9, the adequacy of t is 1.

(2) By the property that +,+, monotonically decreases as E increases. (3) By Definition 9,

where hf,d= lit U dl-lldllo. For any given test set I, Af,d decreases or stays constant as d increases. When t n d = 0, A,,d = 1/41 reaches the maximum. Since +,,A&, 6) is monotonic, the argument is to fix Af,d=l141 and prove that K,+,,,(tld) decreases as 1141 increases. Let Ed = sup ( & I +,,.A&, 6) +4l), &d.f = SUP (4 +M(G 6) 2 1141 + 114119 and AE = Ed - &d,p What needs to be proved is that for any given t, AE decreases as 1141 increases. By Haussler’s result (Lemma 4)

(dim, (P) In (1/&) + In (1/6))

It is easy to see that as E decreases, +(E, 6) grows more and more rapidly because the derivative of the function

(In (UE) + In (1/6) + 1)

Therefore, for the same Af,d, AE decreases as 1141 increases. (4) By Haussler’s result (Lemma4) it follows that for any given 0 < 6 < 1,

$(c, 6)Z0=110ll, for all 0 < E < 1. Hence, sup (rl +,,A&, S ) l l l 0 l l ) = 1. By Definition 9

KM.ri(0) = 1 - sup (E l +,+,(&, 6 ) 2 ( ( 0 [ [ } = 1 - 1 = 0

( 5 ) Let ni = Iltill, i = 1 ,2 ,..., n ,... Since tl c tz c . . . ~ t,, ..., it follows that n, 5 n2 5.. .5 nk 5.. Because +(E, 6) monotonically increases as E decreases, the sequence ci = sup (&I+(&, 6) 2 n i ) , i = 1,2,. .., n,. .. decreases monotonically. There- fore, lim E, exists.

If nk - K, where K is a natural number, then there is a natural number N such that for all i > N, ni = K . Let E* = sup (&I $(&, 6) 2 K = nN+l } . Hence, lim,.d&&) = 1 - E*. On the other side of the equation, it is the case that for all n > N, t, = f and lltll= K . Hence,

i-m

Therefore, the statement is true. If nk - w, by the property (B) of the + function, it is easy to see that both sides

of the equation are zero, Therefore, the lemma is true.

Page 20: A Formal Interpretation of Software Testing as Inductive Inference

22 HONG ZHU

This theorem proves that the construction given in Definition9 is indeed a model of the axiom system.

The induced adequacy measurement satisfies not only all the axioms but also the weakly context free property. Since the weakly context free property characterizes adequacy measurements for random testing (Zhu et al., 1993), the induced adequacy measurement is more suitable for interpreting adequacy measurements for random testing.

Theorem 5

Definition 9 satisfies the weakly context free property. Let M = (MI &,) be a PAC inductive inference machine. The function K M . 6 defined in

Proof By using the definitions; the proof is straightforward. 0

3.3. Relationship between adequate testing and software reliability There are a number of software reliability measures. The particular reliability measure

used in this paper is the probability of failure on demand (p.f.d.), which is defined to be the probability that the program fails to output results consistent with its specification. Because in the PAC model two parameters are used to measure the quality of inductive inference results, in addition to the probability of failure on demand, another parameter will also be used to qualify software reliability.

Definition 10: Probable reliability The &probable reliability of a software system is r if, with likelihood at least 6, the

software’s p.f.d. is at most r. Probable reliability can be considered as an extension to Hamlet’s probable correctness.

A software system is said to be &probably correct if the probability that the software is correct is at least 6 (Hamlet, 1987).

Let M = (M, &,.,) be a PAC inductive inference machine and KM,& be the adequacy measurement induced from M with fixed 6.

Theorem 6 Let p be a program under test, s E P be the specification of p , and t be a test set.

(1) P is PAC learnable by M, (2) M(s 4 t) = p ; and (3) t is sampled at random over D according to a given probability distribution.

Then the &probable reliability of p with respect to s is 1 - KM&) if

In other words, with likelihood at least 6, it is the case that

Page 21: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 23

where the probability is evaluated according to the same probability distribution used in sampling t.

Proof By Definition 7, the proof is straightforward. Informally, the theorem states that with likelihood at least 6 the reliability of a tested

software system is less than one minus the test adequacy measurement. There are four conditions on this statement:

(a) the specification must be learnable by the inductive inference machine; (b) p must be the result of the inductive inference; (c) the program must be correct on all test cases. This follows from the consistency

property of the inductive inference device M and condition (2) that M(s 1 r ) = p ; and (d) the test set is saplpled at random according to the same probability distribution

by which the failure rate is calculated.

The following theorem proves that if both the specification and the program are PAC learnable, condition (b) is unnecessary.

Theorem 7 Let s, p and t be the same as in Theorem 6. The &probable reliability of p with respect

to s is 2( 1 - KM&)) if

(1) p , s E P and P is PAC learnable; (2) p is correct on all test cases in t with respect to s; (3) t is sampled at random according to the same probability distribution as the

software’s operational distribution.

Proof Since P is PAC learnable, there is a PAC inductive inference machine M such that P

is PAC learnable by M. Let p’ = M(s 1 t). By Theorem 6, the &probable reliability of p’ with respect to s is 1 -KM&).

Since p is correct on all test cases in t, it follows that s 1 t = p 1 t. By Theorem 6, the &probable reliability of p’ with respect to p is also 1 -KM&). Since

it follows that, with likelihood at least 6, the probability of failure on demand of p with 0 respect to s is 2(1 - KM&)).

Page 22: A Formal Interpretation of Software Testing as Inductive Inference

24 HONG ZHU

It is worth noting that Theorem 7 does not require that the program is actually generated by the inductive inference machine. It only requires that both the program and the specification are learnable and that the program passes a certain number of random tests. Theorem7 does not even require the actual construction of a PAC inductive inference machine. According to Lemma4, a rule set P is PAC learnable if and only if the pseudo dimension of P is finite. When the pseudo dimension of P is known, a conservative reliability estimation can be made according to the following lemma, which comes from the work of Haussler (1992) and gives an upper bound on the sample complexity of all PAC learnable classes of function.

Lemma 5 (Haussler, 1992) Let P be a permissible set of functions of finite pseudo dimension. When a function

p E P is correct on rn random samples with respect to a function s E P , if

then with probability at least 6 it follows that the failure rate of p with respect to s is less than c.

Proof

Theorem 8 with his settings a = 1/2, v = & and M = 1. In a similar way to the proof of Lemma4, this lemma is a special case of Haussler’s

0 Notice that the number of samples required by the inequality (24) is conservative, as

Haussler claimed that no serious attempt is made to tighten the constants in the inequality. Therefore, if an inductive inference machine can be constructed based on the knowledge of the set of functions, fewer random test cases will be needed by a smaller $(&, 6) function.

An important practical implication of Theorem7 is that knowledge about the program under test and its specification is helpful for software reliability estimation. If it is known that the program and its specification belong to a set P of functions, and if the pseudo dimension of the set P is finite, then the reliability can be estimated according to the number of random test cases. The more such knowledge about the software one has, the smaller the set P can be assumed, and the smaller the pseudo dimension of P will be. Consequently, a better reliability estimation can be achieved by a random test set of the same size. Alternatively, a smaller random test set will be sufficient to estimate software reliability with the same accuracy. This approach to software reliability estimation is fundamentally different from statistical estimations, which essentially treat software as a black box and ignore all knowledge about the software. The practical application of Theorem 7 in the estimation of a particular software system consists of three steps.

The first step is the study of the program and the specification to determine a set P of functions such that both the program and the specification belong to the set.

The second step is to determine the pseudo dimension of P . If the dimension is finite, then the random testing can start immediately with the random generation of test cases. The number of test cases can be obtained from (24) according to the quality requirements

Page 23: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 25

( E , 6) of the software and P’s pseudo dimension. Otherwise, further study of the program and the specification should be conducted to get a smaller subset of P.

An alternative step to the second step is to construct a PAC inductive inference machine for learning P and to determine the +(&, 6) function of the PAC machine directly. This may be difficult, but can produce better estimation.

The third step is to test the software on the random test cases. If the software passes all the tests, an assertion on the reliability of the software can be made that with likelihood at least S the software’s probability of failure on demand is at most E. Otherwise, if the software fails on some test cases and the software is modified to fix the bugs, the whole reliability estimation process must restart from step one.

The following properties about pseudo dimension are useful for determining the pseudo dimension of a set of functions. Readers are referred to the work of Haussler (1992) for their proof.

Lemma 6 (Dudley, 1978) If a set P of real-valued functions is a function vector space, then the pseudo dimension

of P is the dimension of the vector space.

Example 3 For example, the set of n variable multinomials of degree less than or equal to k is a

function vector space. The dimension of the vector space is ( k + ly. Therefore, its pseudo dimension is (k+ l y .

Suppose it is known that (a) the program has N inputs, (b) any execution of the program will not exceed K assignments, and (c) all the assignments are linear functions of the inputs, then the function computed by the program will be an N variable multinomial. In some cases, but not the worst case, the degree of the multinomial can be 2K/2 if, say, the assignments y:=x; x:=x x y are repeated K times. If there is reason to say that the function required by the specification is of the same degree as the function computed by the program, then one can estimate the reliability of the software according to (24) by substituting dim, (P) with (2K’2 + ly. However, the degree of the function computed by the program could be much less than that. Therefore, a much smaller pseudo dimension may be obtained if one has more knowledge about the program.

The following property of pseudo dimension enables application of the idea of pertur- bation testing (Zeil, 1983) to random testing and software reliability estimation.

Lemma 7 (Wenocur and Dudley, 1981) Let p be a fixed function from D to real numbers, and G be a set of functions from

D to real numbers. The pseudo dimension of { p + g ( g E GI is equal to the pseudo dimension of G.

Now, suppose that p is the function computed by a program. As in perturbation testing, it is assumed that the program is close to being correct in the sense that the error function e between the program p and the required function s belongs to an error space G. That is, s = p + e, where e E G. Therefore, the reliability of the software can be estimated by

Page 24: A Formal Interpretation of Software Testing as Inductive Inference

26 HONG ZHU

using the pseudo dimension of G. This application of Theorem7 and Lemma5 is of particular importance because the number of assignments in a program that contains loops may not be bounded. Hence, the function computed by the program cannot be considered as a multinomial as was done in Example 3. However, the reliability can still be estimated by considering its error space to be a function vector space such as multinomials.

4. CONCLUDING REMARKS

4.1. Summary of the results This paper investigated the foundation of software testing by interpreting axioms of

test adequacy as properties of inductive inference. A model of Weyuker’s axiom system of program-based adequacy criteria was constructed on the basis of Gold’s identification in the limit. The conservative and simplest hypothesis nature of the inductive inference underlying software testing was manifested. As a result of the formal interpretation using the model of identification in the limit, the convergence of the inductive inference process underlying software testing was proved to be a condition for the correctness of adequately tested software. It was also proved that if both the program and its specification are learnable, the correctness of the software can be tested on a finite number of test cases. As an application of the results, the regularity testing hypothesis of BougC et al. (1986) and Bernot et al. (1991) has been validated.

A model of Zhu and Hall’s axiom system of adequacy measurements was constructed on the basis of PAC inductive inference. In the PAC inductive inference model, the measurement of the convergence of an inductive inference process was used to interpret test adequacy measurements. It was proved that adequacy measurement is closely related to the probable reliability of adequately tested software. Here, the &probable reliability of a software system is p, if with likelihood at least S the probability of failure on demand of the software is at most p. Probable reliability is an extension of Hamlet’s probable correctness. A new approach to estimating software reliability based on random testing is proposed that takes software complexity into account rather than treating the software as a black box.

There are two practical implications of the results of this paper. First, software testing should take software complexity into consideration. In both models it is the case that the convergence of the inductive inference process underlying software testing may depend on the complexity of the program under test. In the model of identification in the limit, the dependence of test size on the software complexity is in the form of (a)leaming complexity function, while in the PAC model, it is in the form of pseudo dimension. Generally speaking, the more complex the software is, the more test cases should be used. In software testing practice, program complexity has only partly been taken into account by various structural testing methods, but not at all by random testing. In this paper, it has been shown how complexity can be involved in random testing.

Secondly, the results of this paper suggest that when a tested software system is asserted to be correct or when the reliability of the software is estimated, it is necessary to determine a set of functions of which the software can be considered a member and ask whether the set of functions can be learned by an inductive inference machine and whether the test set is sufficient to learn the particular software as an element of the set. Such a

Page 25: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 27

set of functions fixes the context of the inductive inference underlying software testing. This is similar to what fault-based testing methods do; mutation testing measures test adequacy according to whether a test set kills a set of mutants of the program under test. However, for other existing structural testing methods and random testing, there is not such an explicit context. In this paper, it has also been shown how the idea of perturbation testing can be applied to random testing.

4.2. Related work and further work

4.2.1. The relationship between test adequacy and inductive inference The similarity between the activities in software testing and inductive inference is

obvious. However, only a few papers in the literature have systematically studied the relationship between them. A survey and review of research on the relationship between these two subjects has been undertaken by Zhu et al. (1992).

Weyuker ( 1983) proposed an adequacy criterion explicitly involving inductive inference. The criterion was based on program synthesis, which is a particular form of inductive inference aimed at generating programs from sequences of input/output pairs. It required that the program and the specification should be successfully synthesized from an adequate test set. The C,,, function used to interpret Weyuker’s axiom system is different from Weyuker’s criterion. In the definition of C,,,,, the specification is considered as the rule to learn and a test set is adequate if the program can be generated as a hypothesis. C, is an adequacy criterion independent of the particular form of the specification. Hence, it is suitable for interpreting axioms of program-based adequacy criteria. Of course, an adequacy criterion can be defined by considering the program as the rule to learn and the specification as the hypothesis to be generated. Such a criterion would enable one to interpret axioms of specification-based adequacy criteria. However, there are few such axioms (Zhu and Hall, 1992).

Another approach to defining adequacy criteria is to consider equivalence between the software and its specification as a concept to learn. A test set t is adequate if software correctness can be inductively inferred from t. This seems related to Hamlet’s software probable correctness (Hamlet, 1987). As mentioned above, probable reliability is an extension of probable correctness. A more thorough investigation of the properties of probable reliability seems very promising.

4.2.2. The relationship between testability and learnability There are some interesting papers on the relationship between testability and learnability.

Chemiavsky and Smith (1987) defined a class of programs to be testable if there exists a function that generates a finite set of test cases on which the program under test can be distinguished from all other programs in the class. This definition of testability stems from mutation test adequacy criteria, and has been investigated by Budd and Angluin (1982). Chemiavsky and Smith compared testability with learnability in Gold’s model of identification in the limit. They proved that testability and learnability are incomparable. In some cases, learning may be easier than testing. However, the results in the paper suggest that testing is easier than learning. This seems more intuitively and conceptually acceptable.

Page 26: A Formal Interpretation of Software Testing as Inductive Inference

28 HONG ZHU

Also inspired by Gold’s identification in the limit, Cherniavsky and Statman (1988) extended the above notion of testability, and introduced the notion of f i e d time testability, finite time testability, and testability in the limit. Testability in the limit is the counterpart of Gold’s identification in the limit. A problem for further work is whether testability in the limit is comparable with learnability by identification in the limit. The work in Section 2 may be applicable to this problem.

For { 0, 1 )-valued functions, a set of functions is PAC learnable if and only if it has a finite Vapnik-Chervonenkis dimension (VC dimension) (Blumer et al., 1989). Based on the theory of PAC inductive inference, Romanik (1992) proposed an interesting theory of approximate testing. A class of objects was defined to be approximately testable if given any target object in the class and any error bound, a finite test set can be specified such that any other object in the class consistent with the target on the test set is within the error bound. Romanik studied the relationship between VC dimension and testability. She proved that a class of objects is testable if it has finite VC dimension and has a countable dense approximation. Since in this paper programs are computable functions defined on a countable domain, Romanik‘s result seems in some sense equivalent to Theorem 7 of this paper. Romanik also proposed the notion of Vapnik-Chervonenkis program dimension (VCP dimension), which is an extension of VC dimension. Romanik (1992) and Romanik and Vitter (1 993) established the relationship between VCP dimension and approximate testability, so that the VCP dimension can be considered as a testing complexity measure- ment. Interesting results on the VCP dimensions of straight line programs, if-then-else programs and loop programs were obtained. The relationship between pseudo dimension and VCP dimension was discussed, so Haussler’s work was applied. The work of Romanik and her colleague has been done partly in parallel with the work of Section3 of this paper. While they attempted to introduce novel notions of software testing, this paper focuses on interpreting the general principles (axioms) of software testing. Another difference of this work from that of Romanik is that pseudo dimension and its properties have been directly employed here in reliability estimation. However, through using pseudo dimension and VC dimension, both threads of work lead to the same relationship between learnability and testing. The work reported in this paper can be considered as a validation of the notions introduced by Romanik.

4.2.3. The relationship between reliability, test adequacy and random testing

A number of researchers have investigated statistical estimation of software reliability based on the results of random testing. For example, Howden (1987) and Parnas et al. (1990) proposed estimations of software reliability based on the total number of random test cases and the number of test cases on which the software fails. Miller et al. (1992) proposed a theory of software reliability estimation applicable to both black box random testing and sub-domain testing.

Duran and Ntafos (1984) compared the fault detecting ability of random testing and structural testing, and surprisingly discovered that random testing is as effective as structural testing in detecting faults. Considering this result counter-intuitive, Hamlet and Taylor (1990) repeated the research more extensively and arrived at more precise statements about the relationship between partition probability, failure rate and test effectiveness. However, their results corroborated the results of Duran and Ntafos. Recently, in the

Page 27: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 29

comparison of random testing with structural testing, Tsoukalas et al. (1993) considered the situation where the cost of testing is significant. Their conclusion was that when the cost of testing is significant, structural testing is superior to random testing.

This paper examines the foundation of test adequacy criteria rather than specific estimations of software reliability. The approach to estimating probable reliability outlined in Section 3.3 differs from existing methods of software reliability estimation in taking software complexity into account. However, the constants in the inequality (24) are not yet tight enough for practical use. Further research in this direction is necessary.

Acknowledgements A part of this work was done while the author was with the Department of Computing, The Open University, U.K. The author is most grateful to Professor P. A. V. Hall, Dr J. H. R. May and Dr Lingzi Jin. The author would also like to thank the anonymous referees for their comments on an earlier version of the paper.

References Angluin, D. and Smith, C. H. (1987) ‘Inductive inference’, in Shapiro, S. (ed.), Encyclopaedia of

ArtiJicial Intelligence, Vol. 11, Wiley Interscience, New York, U.S.A., pp. 409-418. Barendregt, H. and Nipkow, T. (eds) (1994) Types for Proofs and Programs, VIII, Lecture Notes

in Computer Science, 806, Springer-Verlag. Bernot, G., Gaudel, M. C. and Marre, B. (1991) ‘Software testing based on formal specifications: a

theory and a tool’, Software Engineering Journal, 6(6), 387405. Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1989) ‘Learnability and the Vapnik-

Chervonenkis dimension’, Journal of ACM, 36(4), 929-965. Bouge, L., Choquet, N., Fribourg, L. and Gaudel, M.-C. (1986) ‘Test sets generation from algebraic

specifications using logic programming’, The Journal of Systems and Sofhvare, 6(4), 343-360. Budd, T. A. and Angluin, D. (1982) ‘Two notions of correctness and their relation to testing’, Acta

Informatica, 18(1), 31-45. Case, J. and Smith, C. (1983) ‘Comparison of identification criteria for machine inductive inference’,

Theoretical Computer Science, 25(2), 193-220. Chang. C. C. and Keisler, H. J. (1973) Model Theory, North-Holland, Amsterdam, The Netherlands. Cherniavsky, J. C. and Smith, C. H. (1987) ‘A recursion theoretic approach to program testing’,

IEEE Transactions on Software Engineering, 13(7), 777-784. Cherniavsky, J. C. and Statman, R. (1988) ‘Testing: an abstract approach’, in Proceedings of the

Second Workshop on Software Testing, VeriJication and Analysis, Banff, Canada, IEEE Computer Society Press, Los Alamitos, California, U.S.A., pp. 38-44.

DeMillo, R. A., Lipton, R. J. and Sayward, F. G. (1978) ‘Hints on test data selection: help for the practising programmer’, IEEE Computer, 11(4), 34-41.

Dudley, R. M. (1978) ‘Central limit theorems for empirical measures’, The Annuls of Probability,

Duran, J. W. and Ntafos, S. (1984) ‘An evaluation of random testing’, IEEE Transactions on

Frankl, P. G. and Weyuker, J. E. (1988) ‘An applicable family of data flow testing criteria’, IEEE

Gold, E. M. (1967) ‘Language identification in the limit’, Information and Control, 10(5), 447474. Goodenough, J. B. and Gerhart, S . L. (1975) ‘Toward a theory of test data selection’, IEEE

Transactions on Software Engineering, 1(2), 156-173. Goodenough, J. B. and Gerhart, S. L. (1977) ‘Toward a theory of testing: Data selection criteria’,

in Yeh, R. T. (ed.), Current Trends in Programming Methodology, Vol. 2, Prentice Hall, Englewood Cliffs, New Jersey, U.S.A., pp. 44-79.

6(6), 899-929.

Software Engineering, 10(4), 4 3 8 4 .

Transactions on Software Engineering, 14( lo), 1483-1498.

Hamlet, R. G. (1987) ‘Probable correctness theory’, Information Processing Letters, 25(1), 17-25.

Page 28: A Formal Interpretation of Software Testing as Inductive Inference

30 HONG ZHU

Hamlet, R. G. and Taylor, R. (1990) ‘Partition testing does not inspire confidence’, IEEE Transactions

Haussler, D. (1992) ‘Decision theoretic generalizations of the PAC model for neural net and other

Howden, W. E. (1976) ‘Reliability of the path analysis testing strategy’, IEEE Transactions on

Howden, W. E. (1982) ‘Weak mutation testing and completeness of test sets’, IEEE Transactions

Howden, W. E. (1987) Functional Program Testing And Analysis, McGraw-Hill, New York, U.S.A. Hutchinson, A. (1994) Afgorithmic Learning, Oxford University Press, Oxford, U.K. Laski, J. and Korel, B. (1983) ‘A data flow oriented program testing strategy’, IEEE Transactions

on Software Engineering, 9(5), 3343. Miller, W. M., Morell, L. J., Noonan, R. E., Park, S. K., Nicol, D. M., Mumll, B. W. and Voas,

J. M. (1992) ‘Estimating the probability of failure when testing reveals no failures’, IEEE Transactions on Software Engineering, 18( l), 3343.

Ntafos, S. C. (1984) ‘On required element testing’, IEEE Transactions on Sofrware Engineering,

Pamas, D. L., van Schouwen, A. J. and Kwan, S. P. (1990) ‘Evaluation of safety-critical software’, Communications of the ACM, 33(6), 636-648.

Parrish, A. S. and Zweben, S. H. (1991) ‘Analysis and refinement of software test data adequacy properties’, IEEE Transactions on Software Engineering, 17(6), 565-58 1.

Parrish, A. S. and Zweben, S. H. (1993) ‘Clarifying some fundamental concepts in software testing’, IEEE Transactions on Software Engineering, 19(7), 742-746.

Rapps, S. and Weyuker, E. J. (1985) ‘Selecting software test data using data flow information’, IEEE Transactions on Software Engineering, 11(4), 367-375.

Romanik, K. A. (1992) ‘Approximate testing theory’, Ph.D. dissertation, University of Maryland, College Park, Maryland, U.S.A., Computer Science Technical Report Series CS-TR-2988, UMIACS-

Romanik, K. A. and Vitter, J. S. (1993) ‘Using Vapnik-Chervonenkis dimension to analyse the testing complexity of program segments’, Technical Report SOCS-93.7, School of Computer Science, McGill University, Canada.

Smith, C. H. and Angluin, D. (1983) ‘Inductive inference: theory and methods’, Computing Surveys,

Tsoukalas, M. Z., Duran, J. W. and Ntafos, S. C. (1993) ‘On some reliability estimation problems in random and partition testing’, IEEE Transactions on Software Engineering, 19(7), 687-697.

Valiant, L. C. (1984) ‘A theory of the learnable’, Communications of the ACM, 27(11), 1134-1142. Wenocur, R. S. and Dudley, R. M. (1981) ‘Some special Vapnik-Chervonenkis classes’, Discrete

Mathematics, 33, 3 13-3 18. Weyuker, E. J. (1983) ‘Assessing test data adequacy through program inference’, ACM Transactions

on Programming Lunguages and Systems, 5(4), 641-655. Weyuker, E. J. (1986) ‘Axiomatizing software test data adequacy’, IEEE Transactions on Software

Engineering, 12(12), 1128-1 138. Weyuker, E. J. (1988) ‘The evaluation of program-based software test data adequacy criteria’,

Communications of the ACM, 31(6), 668-675. Woodward, M. R. (1993) ‘Errors in algebraic specifications and an experimental mutation testing

tool’, Sofrware Engineering Journal, 8(4), 21 1-224. Woodward, M. R. and Halewood, K. (1988) ‘From weak to strong, dead or alive? Analysis of some

mutation testing issues’, Proceedings of the Second Workshop on Sofrware Testing, Verification and Analysis, Banff, Canada, IEEE Computer Society Press, Los Alamitos, California, U.S.A.,

Zeil, S . J. (1983) ‘Testing for perturbations of program statements’, IEEE Transactions on Sofrware Engineering, 9(3), 335-346.

Zhu, H. and Hall, P. A. V. (1992) ‘Test data adequacy with respect to specifications and related properties’, Technical Report No. 92/06, Department of Computing, The Open University, Milton Keynes, U.K.

on Sofrware Engineering, 16(12), 1402-141 1.

learning applications’, Information and Computation, 100( l), 78-150.

Software Engineering, 2( 3). 208-2 1 5 .

on Sofrware Engineering, 8(4), 371-379.

10(6), 795-803.

TR-92- 121.

15(3), 235-269.

pp. 152-158.

Page 29: A Formal Interpretation of Software Testing as Inductive Inference

TESTING AS INDUCTIVE INFERENCE 31

Zhu, H. and Hall, P. A. V. (1993) ‘Test data adequacy measurement’, Software Engineering Journal,

Zhu, H., Hall, P. A. V. and May, J. H. R. (1992) ‘Inductive inference and software testing’, Software Testing, Verification and Reliability, 2(2), 69-82.

Zhu, H., Hall, P. A. V. and May, J. H. R. (1994) ‘Software test coverage and adequacy’, Technical Report No. 94/15, Department of Computing, The Open University, Milton Keynes, U.K.

Zhu, H., Hall, P. A. V. and May, J. H. R. (1995) ‘Understanding software test adequacy-an axiomatic and measurement theory approach’, Mathematics of Dependable Systems, C. Mitchell and V. Stavridou, ed., Clarendon Press, Oxford. 1995, pp. 275-295.

Zhu, H. and Jin, L. (1991) ‘A knowledge-based approach to program synthesis from examples’, Journal of Computer Science and Technology, 6( I), 47-58.

Zweben, S. H. and Gourlay, J. S. (1989) ‘On the adequacy of Weyuker’s test data adequacy axioms’, IEEE Transactions on Software Engineering, 15(4), 496-501.

8(1), 21-30.