Benchmarking Effectiveness

20
Benchmarkin g Effectivene ss for Object-Oriented Unit Testing Anthony J H Simons and Christopher D Thomson

description

Benchmarking Effectiveness. for Object-Oriented Unit Testing Anthony J H Simons and Christopher D Thomson. Overview. Measuring testing? The Behavioural Response Measuring six test cases Evaluation of JUnit tests - PowerPoint PPT Presentation

Transcript of Benchmarking Effectiveness

Page 1: Benchmarking Effectiveness

BenchmarkingEffectiveness

for Object-Oriented Unit Testing

Anthony J H Simons and Christopher D Thomson

Page 2: Benchmarking Effectiveness

Overview

Measuring testing? The Behavioural Response Measuring six test cases Evaluation of JUnit tests Evaluation of JWalk tests

http://www.dcs.shef.ac.uk/~ajhs/jwalk/

Page 3: Benchmarking Effectiveness

Analogy: Metrics and Testing

Things easy to measure (but why?)– metrics: MIT O-O metrics (Chidamber & Kemmerer)– testing: decision-, path-, whatever-coverage– testing: count exceptions, reduce test-set size

Properties you really want (but how?)– metrics: Goal, Question, Metric (Basili et al.)– testing: e.g. mutant killing index– testing: effectiveness and efficiency?

Page 4: Benchmarking Effectiveness

Measuring Testing?

Most approaches measure testing effort,

rather than test effectiveness!

Page 5: Benchmarking Effectiveness

Degrees of Correctness

Suppose an ideal test set– BR : behavioural response (set)– T : tests to be evaluated (bag – duplicates?)

– TE = BR T : effective tests (set)

– TR = T – TE : redundant tests (bag)

Define test metrics– Ef(T) = (|TE | – |TR |) / |BR| : effectiveness

– Ad(T) = |TE | / |BR| : adequacy

Page 6: Benchmarking Effectiveness

Ideal Test Set?

The ideal test set must verify each distinct

response of an object!

Page 7: Benchmarking Effectiveness

What is a Response?

Input response– Account.withdraw(int amount) : 3 partitions

• amount < 0 fail precondition, exception

• amount > balance refuse, no change

• amount <= balance succeed, debit

State response– Stack.pop() : 2 states

• isEmpty() fail precondition, exception

• ! isEmpty() succeed

Page 8: Benchmarking Effectiveness

Behavioural Response – 1

Input response– c.f. exemplars of equivalence partitions– max responses per method, over all states

State response– c.f. state cover, to reach all states– max state-contingent responses, over all methods

Behavioural Response– product of input and state response– checks all argument partitions in all states– c.f. transition cover augmented by exemplars

Page 9: Benchmarking Effectiveness

Behavioural Response – 2

Parametric form: BR(x, y)– stronger ideal sets, for higher x, y

x = length of sequences from each statey = number of exemplars for each partition

Redundant states– higher x rules out faults hiding in duplicated states

Boundary values– higher y verifies equivalence partition boundaries

Useful measure– precise quantification of what has been tested– repeatable guarantees of quality after testing

Page 10: Benchmarking Effectiveness

Compare Testing Methods

JWalk – “Lazy systematic unit testing method”

JUnit – “Expert manual unit

testing method”

Page 11: Benchmarking Effectiveness

JUnit – Beck, Gamma

“Automates testing”– manual test authoring (as good as human expertise)– may focus on positive, miss negative test cases– saved tests automatically re-executed on demand– regression style may mask hard interleaved cases

Test harness– bias: test method “testX” for each method “X” – each “testX” contains n assertions = n test cases– same assertions appear redundantly in “testY”, “testZ”

Page 12: Benchmarking Effectiveness

JWalk – Simons

Lazy specification– static analysis of compiled code– dynamic analysis of state model– adapts to change, revises the state model

Systematic testing– bounded exhaustive state-based exploration– may not generate exemplars for all input partitions– semi-automatic oracle construction (confirm key values)– learns test equivalence classes (predictive testing)– adapts existing oracles, superclass oracles

Page 13: Benchmarking Effectiveness

Six Test Cases

Stack1 – simple linked stack Stack2 – bounded array stack

– change of implementation

Book1 – simple loanable book Book2 – also with reservations

– extension by inheritance

Account1 – with deposit/withdraw Account2 – with preconditions

– refinement of specification

Page 14: Benchmarking Effectiveness

Instructions to Testers

Test each response for each class, similar to the transition cover, but with all equivalence

partitions for method inputs

Page 15: Benchmarking Effectiveness

Behavioural Response

Test Class API Input R State R BR(1,1)

Stack1 6 6 2 12

Stack2 7 7 3 21

Book1 5 5 2 10

Book2 9 10 4 40

Account1 5 6 2 12

Account2 5 9 2 18

ideal test target

Page 16: Benchmarking Effectiveness

JUnit – Expert Testing

Test Class T TE TR Ad(T) Ef(T) time

Stack1 20 12 8 1.00 0.33 11.31

Stack2 23 16 7 0.76 0.43 +14.00

Book1 31 9 22 0.90 -1.30 11.00

Book2 104 21 83 0.53 -1.55 +20.00

Account1 24 12 12 1.00 0.00 14.37

Account2 22 17 5 0.94 0.67 08.44

massive generationstill not effective

Page 17: Benchmarking Effectiveness

JWalk – Test Generation

Test Class T TE TR Ad(T) Ef(T) time

Stack1 12 12 0 1.00 1.00 0.42

Stack2 21 21 0 1.00 1.00 0.50

Book1 10 10 0 1.00 1.00 0.30

Book2 36 36 0 0.90 0.90 0.46

Account1 12 12 0 1.00 1.00 1.17

Account2 17 17 0 0.94 0.94 16.10

no wasted testsmissed 5 inputs

Page 18: Benchmarking Effectiveness

Comparisons

JUnit: expert manual testing– massive over-generation of tests (w.r.t. goal)– sometimes adequate, but not effective– stronger (t2, t3); duplicated; and missed tests– hopelessly inefficient – also debugging test suites!

JWalk: lazy systematic testing– near-ideal coverage, adequate and effective– a few input partitions missed (simple generation strategy)– very efficient use of the tester’s time – sec. not min.– or: two orders (x 1000) more tests, for same effort

Page 19: Benchmarking Effectiveness

Conclusion

Behavioural Response– seems like a useful benchmark (scalable, flexible)– use with formal, semi-formal, informal design methods– measures effectiveness, rather than effort

Moral for testing– don’t hype up automatic test (re-)execution– need systematic test generation tools– automate the parts that humans get wrong!

Page 20: Benchmarking Effectiveness

Any Questions?

http://www.dcs.shef.ac.uk/~ajhs/jwalk/

Put me to the test!