The grand game of testing and some related issues Yuri Gurevich Research in Software Engineering,...

43
The grand game of testing and some related issues Yuri Gurevich Research in Software Engineering, Microsoft
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of The grand game of testing and some related issues Yuri Gurevich Research in Software Engineering,...

The grand game of testingand some related issues

Yuri GurevichResearch in Software Engineering,

Microsoft

Agenda

1. Introducing the grand game2. Various dimensions of testing3. Myers’s paradox4. Zoology of software testing5. Remarks6. Beware of statistics7. Testing vs. scientific experimentation

2

Testing is rather important.Here is a proof by authority

Bill Gates at MIT in 1996: “Testing is another area where I have to say I’m a little bit disappointed in the lack of progress. At Microsoft, in a typical development group, there are many more testers than there are engineers writing code. Yet Engineers spend well over a third of their time doing testing type work. You could say that we spend more time testing than we do writing code. And if you go back through the history of large-scale systems, that’s the way they’ve been.”

3

1. Introducing the grand game

4

In-class exercise

Scenario. A given program reads three integer values from a card. The three values are interpreted as representing the lengths of the sides of a triangle. The program prints a message that states whether the triangle is scalene, isosceles or equilateral.

Task. Write down particular inputs to the program in order to test the program as thoroughly as possible.

From “The Art of Software Testing” by Glenford J. Myers, Wiley 1979.

5

Myers: Do you have these?

1. a scalene triangle2. an isosceles

triangle3. an equilateral

triangle4. 3 permutations of 25. a zero-length side6. a negative-length

side7. three positive sides,

sum of two = third

8. 3 permutations of 79. three positive sides,

sum of two < third10.3 permutations of

9.11.(0,0,0)12. noninteger values13.wrong number of

initial values14.the expected

output in each case.6

CommentsAdditional test ideas:

Real inputs Boolean, character inputs excessively large inputs

Critiquing the spec Are degenerated triangles legal? It seems no. Is 3.0 an integer? It seems no. Why inputs are integer? Naturally they should be real.

Probably to simplify the problem and avoid the rounding issues.

Most importantly, what should the program do on wrong inputs?

But of course finding spec deficiencies may be a part of testing.Why this triangle problem?

He assumes that reader has the necessary knowledge, and so the problem can be formulated easily.

Some alternatives: Euclid’s algorithm, integer multiplication.

7

What programs do on wrong inputs

The description of the expected result is a necessary part of a test is.

No expectations no surprises.

More generally, test what the program is supposed to do as well as what it is not supposed to do.

An electronic trading program is supposed to perform the trades correctly but this is not enough. It is not supposed to round prices and put the difference to programmer’s account.

8

The grand game

Programmer vs. Tester individuals, teams, organizations

The game may be cooperative but the players have different goals. Tester’s goal is to find errors.

Successful tests discover errors. Cf. positive diagnostic in medicine.

It is preferable that Programmer and Tester are distinct.

What if the code does not have any errors? Can you find all the errors?How is the winner determined?

The game is not necessarily antagonistic.

What is a move? When is the game over? There are no clear-cut answers. This is a matter of engineering

and economics.

9

2. Various dimensions of testing

10

A partial taxonomy

Black box, white (or glass) box, grey boxUnit (or module), feature; integration; systemAlpha (in house), beta (by end users)Poking around, manual, automated script, model basedAcceptance, usabilityFunctional; compatibility; regression; performance: load, stressComparison; install, uninstall; recoveryCombinatorialIncremental Sanity, smokeRandomSecurity, privacy, complianceConformance (of implementation to the spec or the other way round)

11

Black box testing

It is functional testing also known as data-driven or input-output testing. It is conformance testing. A spec is given. You try to find inputs where the program disagrees with the spec.The exhaustive testing is usually impossible or infeasible.

Think of testing a compiler or OS.

So when to stop testing? It is all about economics.

12

White box testing

The test data is derived from the program (and too often at the neglect of the spec).What is the analog of the black-box exhaustive testing? Program coverage? No

A better answer is exhaustive path.

13

Exhaustive path testing

Normally infeasibleDoes not guarantee the compliance with the specThe missing path problemPossibly missing data-sensitive errors

if (a-b) < rather than |a-b| < .

14

3. Myers’s paradox

Myers: “The number of uncovered bugs in a program section is proportional to the number of discovered bugs in the section.”

The observation might have been known earlier but we have no relevant references. In fact, the observation seems to be overlooked by the later investigators of reliability.

15

Is the paradox true or false?

Here is a counterexample with two one-line sections:sum(1,1) := 3 2sum(1,2) := 2So the paradox is not literally true. Is it false?

16

From sections to modules

Due to the explosion in the size and complexity of software products, it is more appropriate these days to speak about modules rather than program sections.What about component interaction?

17

One analogy

“A bird in the hand is worth two in the bush.”Is that literally true? The value of any bird in any hand is worth (= or ?) the combined value of any two birds in any bush. Ridiculous! But is the saying false?No, people say it for a good reason.

“Sinitza in the hands is better than zhuravl in the sky” is different, more specific, and yet the meaning is the same.

We do not evaluate the paradox on the true/false scale. One other scale is from “hogwash” to “brilliant.”

18

One meaning of the paradox

Beware. Whatever you’ve done to debug a module and however you measure your debugging effort provide no guarantee that the module is well debugged.Here is 20-years later reference of a good beware article:“A Critique of Software Defect Prediction Models,”by Norman E. Fenton and Martin Neil, IEEE TSE, 1999From the abstract: “Many organizations want to predict the number of defects (faults) in software systems, before they are deployed, to gauge the likely delivered quality and maintenance effort. To help in this, numerous software metrics and statistical models have been developed, with a correspondingly large literature. We provide a critical review of this literature and the state-of-the-art.”

19

A stronger meaning of the paradox

Sriram Biyani and P. Santhanam, "Exploring Defect Data from Development and Customer Usage on Software Modules over Multiple Releases," ISSRE 1998, International Symposium on Software Reliability Engineering.“We have shown that: modules with more defects in development

are likely to have more defects in the field. ...”

20

Deriving Myers’s paradox

The paradox, in its literal form, can actually be derived from assumptions that are strong but not completely ridiculous. We give two illustrations.

21

Example 1: Equally effective testing

Assume that the two program sections (or modules) have been tested with the same efficiency; the fraction of bugs caught by the testing is the same, say 1/k, for both sections. Then, if b bugs are found in a section, there are kb bugs there altogether, of which

kb - b = b(k-1)bugs remain undiscovered.

22

Example 2: Uniform distribution of bugs

Modules A, B have 10,000 lines of code each. A has 1000 bugs; B has 500. In both cases the bugs are uniformly distributed.A perfect inspection of 1000 lines of code reveals about 100 bugs in A and about 50 bugs in B. Twice as many bugs have been found in A.A has twice as many remaining bugs: 900 vs. 450.

23

4. Zoology of software testing

24

25

Counting bugs

The goal of testing is often formulated thus: find bugs. This is not unreasonable as bugs are always there. The New York Time example.

But bug counting is harder one may think. Think a high-school essay.

Programming languages are less expressive but (a) the logic may be more involved and (b) the computer is more demanding detail-wise than a human interlocutor.

Think 1+f(x).

Where is the bug?

The story of a merchant with two enemies. One poisons the water, and the other makes a hole in the container.Take-home exercise. Program a function f so that some value of f is wrong and there are two different lines i and j of code

such that changing line i fixes the bug, changing line j fixed the bug, but both changes return the bug.

26

27

What is a bug?

That is a huge issue, and we are not going there. Just a couple of remarks.Taxonomies of bugsTriage One man’s bug may be another man’s feature.

5. Remarks

28

One advantage of human inspection

Test automation is key of course but notice this.Human inspection detects actual errors rather then their symptoms.

29

30

Testing: a good research area

Testing is much needed.The theory is still maturing. And thus the entry is still relatively easy.

The field is huge and diverse. There is plenty room for engineering

ingenuity. There is plenty room for mathematical

analysis.

As we will see, testing is a gamein more sense than one.

6. PHILOSOPHICAL REMARKS

31

32

6. Beware of statistics:Simpson’s paradox

A university is trying out two automated diagnostic tools A and B. Student with computer problems can dial either tool; if the tool fails to solve the problem then a human will take a call. Is the following possible?Tool B works better for PCs.Tool B works better for Macs.Tool A works better overall.

33

PC S(uccess)

F(ailure) Total # % of S

Tool A 18 12 30 60

Tool B 7 3 10 70

Both S F Total # % of S

Tool A 20 20 40 50

Tool B 16 24 40 40

Mac S F Total # % of S

Tool A 2 8 10 20

Tool B 9 21 30 30

7. SOFTWARE TESTING AND SCIENTIFIC EXPERIMENTATION

34

Testing is no verification

Testing and experimentation find faults. Neither prove correctness. But the evidence may mount. This brings us to the problem of inductive reasoning We observed sun rising on the East so many

times. Hence it always rises on the East.

35

One induction joke

Is every odd number prime? Engineer: Physicist:

But induction itself is no joke. Induction in natural sciences Induction reasoning of software

employment.

36

The problem of induction

Hume: You can check only a finite number out of the infinitely many instances. How do you justify the jump to conclusion? Is it even useful to bother to check more instances?There have been numerous attempts to justify induction e.g. probabilistically.

37

Popper

The justification of induction is so hard because it is impossible.Any scientific theory is just that, a theory.

Example: Newton to Einstein

On the other hand, scientific theories can be falsified. There is no symmetry between verification and falsification.Additional observations may have no additional weight.

The water boiling example. You have to challenge the theory

38

Falsification principle

Falsification is the demarcation line between science like physics, chemistry, etc.

and pseudo-science like astrology, Marxism, etc.

The principle caught the imagination of scientists.

39

Falsifiability of the falsification principle: the logic approach

Self-application. Is the falsification principle scientific?

Existential claimsx (x) can in principle be verified provided that (x) is observable.

Universal claims x (x) can in principle be falsified provided that is observable. What is the claim is of the form x y (x,y) or ...?Typically scientific claims are universal e.g. in Newtonian mechanics, any two objects attract each other … Maybe this explains Popper’s popularity with the scientists.

40

Pragmatics

What is really the form of a scientific claim?There are many explanations of a failed experiment. The predicted discovery of Pluto was a

triumph of Newtonian mechanics. What would happen if Pluto were not found?

Theories are remarkably resistant to falsification, especially if there is no good alternative theory.

41

Falsification principle and software

Is falsifiability a demarcation line between bad and good software? Can you falsify software? If yes, how many bugs does it take to

do the trick?

Buggy software is fine until we have a better one. Like with theories.

42

Distinction

Scientific experimentation is harder.It takes a lot of ingenuity to master one test.It is typically black box. Genetic research is as close as you get to an

exception in natural sciences.

Can you influence the “programmer”?

43