Learning near optimum inspection policies

Post on 09-Jul-2015

2.358 views 1 download

description

High assurance software requires extensive and expensive assessment. There are many forms of software assessment, ranging from manual inspections to automatic formal methods. These assessment methods differ in their effectiveness and the effort required to apply them. Typically, the more effective methods are more expensive. Hence, project managers often "bias" the assessment resources and apply more effort where they think that extra effort might be most useful. If most of the assessment effort explores project artifacts A,B,C,D, then that leaves a "blind spot" in E,F,G,H,I,.... Blind spots can compromise high assurance software. It is therefore important to discuss the bias introduced by the inspection policy. To but the matter in a nutshell, we need to ask "how blinding is our bias?" This talk contrasts three different kinds of "bias" in selecting what code modules to inspect: 1) manual methods such as "read the biggest thing first/last" 2) traditional data mining methods such as those advocated by author and those deployed in NASA-funded inspection tools. 3) a new data miner called "WHICH" We find that #1 usually outperforms #2 . This result calls into question many years of research by the speaker (translation: "oh dear....."). But we also find that #3 almost always out-performs #1 or #2 (translation: "phew!!"). In fact #3 works so well that we speculate that it could be used as a proxy for determining the actual number of defects remaining to be found, after inspecting Z% of the code. ABOUT THE SPEAKER: Dr. Tim Menzies (tim@menzies.us) has been working on advanced modeling and AI since 1986. He received his PhD from the University of New South Wales, Sydney, Australia and is the author of over 160 refereeed papers. A former research chair for NASA, Dr. Menzies is now a associate professor at the West Virginia University's Lane Department of Computer Science and Electrical Engineering. For more information, visit his web page at http:// menzies.us.

Transcript of Learning near optimum inspection policies

1

Learning near Learning near optimum inspection optimum inspection

policiespoliciesTim@Menzies.us (WVU)Tim@Menzies.us (WVU)

Zach Milton, WVUZach Milton, WVU

Feb 5Feb 520082008

2

The Briand Threshold% defectiveModulesdetected

% LOC read

(100,100)Goal: overthreshold thresh

old

3

“Manual Up”: the Koru Hypothesis

% defectiveModulesdetected

% LOC read

(100,100)

Smaller modules have disproportionately more defectsIf so, then we'll find more bugs sooner if we read “manualUp” (I.e.

read smallest modules first)

threshold

Manual

4

Optimum Detector

% defectiveModulesdetected

% LOC read

(100,100)

X% of the code in defective modules. Some perfect oracle finds all defective modules, which, whenwe inspect manualUp, we find all the defects

X%

threshold

Manualoptimal

5

Sub-optimum, useful automatic detector

% defectiveModulesdetected

% LOC read

(100,100)

Triggers on Y% of the code, not all of which is defective.Useful if above manual and threshold

X% Y%

threshold

optimal Manual

useful

6

Comparing two detectors

% defectiveModulesdetected

% LOC readReport detector performance as area = AUC(detector)/AUC(optimal• 0 <= area <= 1 (larger is better)• For 10 data sets, 10 randomizations or ordering, 3-way hold-outs (66%

train, 33% test):• 300 numbers for each detector; • compare with Mann-Whitney (99% confidence)

optimal

detector 2

detector1

7

Technical details

% defectiveModulesdetected

% LOC read

detector

We don’t knowThe trajectory from from Y% read to 100% read

Y% 100 %

We’ll make the mostpessimistic assumption(so our results are betterthan what we report below)

Other assumptions:• All bugs treated equally (no concept of defect severity)• Inspections are % effective at recognizing defective modules

(and since we report ratio of two AUC curves, cancels out)• So these results are independent of inspection

effectiveness)

8

Three class of detectors• Manual methods

– Manual up (inspect smallest modules first)– Manual down (inspect largest first)

• Traditional learners– J48, NaiveBayes, RIPPER

• A new learner– Different versions of WHICH– E.g. WHICH2loc discretizes log of numbers into two

bins and favors rules that selects least LOC– E.g. WHICH8 discretizes log of numbers into 8 bins

• For each learner– Take the modules selected via learning– Sort them in LOC size– Inspect them smallest to largest– Track when we stumble over a module with defects

9

What is WHICH?• WHICH= our new idea

– Technically: is a stochastic best first search, or SBFS.

– The implementation of this type of search is not done with a tree, but rather a stack.

• Motto of WHICH:– Start as you mean to go on

– If the learned theory is to be assessed via criteria “P”

– Use “P” at every step of growing, pruning the theory

• -Note: standard learners – Grow/prune via criteria “Q”, then assess the

learned theory via criteria “P”

10

The logic of WHICH

• If the red path in the above tree is a current rule that is scoring (via “P”) very well and the blue path is another rule that is scoring well also, why not skip the adding of one conjunction at a time?

• Instead combine the two paths so far and see if that works out better.

• This would essentially skip the growing a and bit move right to a potentially more optimum solution

11

WHICH Implementation

• Items in a stack scored and sorted via criteria “P”

• Once the stack is picked, two rules are selected randomly based on their scores and combined.

• The new rule is then scored and placed back in the stack.

• It is placed in sorted order.

outlook=overcasthumidity=highrain=truehumidity=lowrain=false...

outlook=overcastAND rain=true

12

WHICH Implementation

Continued

• New rules that score high have a better chance to be combined.

• This leads to bigger rules over time.

• This process is repeated several times until either– A total number of picks is reached

– or a criterion is met( an early stopping condition )

outlook=overcasthumidity=highoutlook=overcastAND rain=truerain=truehumidity=lowrain=false

humidity=highAND outlook=overcastAND rain=true

13

WHICH Summary

• WHICH initially creates a sorted stack of all attribute ranges in isolation.

• It then, based on score, randomly selects two rules from the stack, combines them, and places the new rule in the stack in sorted order.

• It continues to do this until a stopping criterion is met.

• WHICH supports both conjunction and disjunctions.

• If a the two rules selected both contain different ranges from the same attribute, they are OR'd together instead of AND'd

outlook=sunnyAND rain=true

outlook=overcast

outlook = [ sunny OR overcast ]AND rain = true

Sample results

WHICH2

manual Down

Manual upManual up

Others

But how representative are these results?

15

Results type #1 : Results type #1 : 5/8 examples5/8 examples

WHICH > manual > traditionalWHICH > manual > traditional

16

“areas” in cm1 which2, 0.0, 57.4, 68.1, 71.5, 81.5, [---------------------------- |+++++ ] manualUp, 48.3, 57.4, 59.8, 65.3, 71.5, [ -----| ++++ ] nBayes, 36.2, 46.0, 52.1, 59.1, 69.2, [ ----- | ++++++ ] manualDown, 33.6, 40.3, 47.6, 49.3, 60.2, [ ---- |++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 16.1, [++++++++ ] which8, 0.0, 0.0, 11.4, 26.2, 35.6, [ | +++++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 10.4, [+++++ ] which4, 0.0, 0.0, 0.0, 41.2, 69.0, [ ++++++++++++++ ] which2loc, 0.0, 0.0, 0.0, 0.0, 40.7, [++++++++++++++++++++ ] jRip, 0.0, 0.0, 5.8, 11.5, 24.1, [ | +++++++ ] j48, 0.0, 0.0, 0.1, 12.9, 33.3, [ +++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 1, 9, 0, 9 manualUp, 1, 9, 0, 9 nBayes, 0, 8, 2, 6 manualDown, 0, 7, 3, 4 which8, 3, 3, 4, -1 which4, 3, 3, 4, -1 jRip, 3, 3, 4, -1 j48, 3, 3, 4, -1 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

1. Distributions of results

2. Statistical results comparing the distributions (which has the largest median ranked values?)

17

“areas” in KC1 which2, 71.4, 73.8, 76.0, 78.0, 81.8, [ --| ++ ] manualUp, 64.5, 65.8, 67.6, 68.9, 70.0, [ -|+ ] nBayes, 54.9, 60.2, 61.9, 63.0, 67.7, [ ---|+++ ] which4, 0.0, 49.8, 52.9, 55.2, 60.5, [------------------------ |+++ ] manualDown, 39.7, 42.2, 43.3, 45.2, 47.7, [ --|++ ] j48, 11.6, 20.5, 27.8, 31.7, 40.1, [ ----- | +++++ ] jRip, 10.2, 17.3, 21.3, 25.2, 32.4, [ ---- | ++++ ] which8loc, 0.0, 0.0, 0.0, 1.0, 2.2, [ ] which8, 0.0, 0.0, 0.0, 2.0, 33.9, [+++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 1.1, [ ] which2loc, 0.0, 0.0, 0.0, 0.0, 2.1, [+ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 manualUp, 0, 9, 1, 8 nBayes, 0, 8, 2, 6 which4, 0, 7, 3, 4 manualDown, 0, 6, 4, 2 j48, 0, 5, 5, 0 jRip, 0, 4, 6, -2 which8loc, 1, 2, 7, -5 which8, 3, 0, 7, -7 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

18

“areas” in KC2 which2, 65.6, 76.0, 81.6, 84.6, 88.5, [ ------ | ++ ] manualUp, 57.9, 65.4, 69.3, 71.0, 76.6, [ ---- |+++ ] nBayes, 47.0, 54.8, 58.7, 61.0, 69.4, [ ---- |+++++ ] which4, 43.1, 52.5, 59.4, 66.8, 79.6, [ ----- | +++++++ ] manualDown, 37.9, 43.1, 46.1, 52.3, 62.4, [ --- | ++++++ ] which8, 26.3, 36.5, 41.2, 47.6, 56.5, [ ------ | +++++ ] j48, 26.0, 36.1, 41.2, 45.9, 59.8, [ ------ | +++++++ ] jRip, 22.2, 36.0, 42.2, 49.5, 65.2, [ ------- | ++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 5.9, [++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 2.9, [+ ] which2loc, 0.0, 0.0, 0.0, 0.0, 3.1, [+ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 manualUp, 0, 9, 1, 8 which4, 1, 7, 2, 5 nBayes, 1, 7, 2, 5 manualDown, 1, 5, 4, 1 jRip, 3, 3, 4, -1 which8, 2, 3, 5, -2 j48, 2, 3, 5, -2 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

19

“areas” in MW1_mod which2, 35.8, 57.4, 62.4, 70.8, 83.3, [ ----------- | +++++++ ] manualDown, 42.8, 52.1, 60.2, 63.7, 71.8, [ ----- |+++++ ] manualUp, 37.1, 44.0, 47.8, 51.9, 62.5, [ ---- | ++++++ ] which8, 0.1, 35.6, 39.3, 47.6, 60.4, [----------------- | +++++++ ] nBayes, 19.5, 33.1, 41.7, 47.7, 62.1, [ ------- | ++++++++ ] which4, 0.0, 25.8, 42.7, 49.8, 60.6, [------------ | ++++++ ] j48, 0.0, 10.0, 20.0, 24.3, 42.9, [----- | ++++++++++ ] jRip, 0.0, 7.9, 15.8, 31.2, 49.4, [--- | ++++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 10.5, [+++++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 10.4, [+++++ ] which2loc, 0.0, 0.0, 0.0, 0.0, 25.6, [++++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 1, 9, 0, 9 manualDown, 1, 9, 0, 9 manualUp, 2, 6, 2, 4 which4, 3, 5, 2, 3 nBayes, 3, 5, 2, 3 which8, 2, 5, 3, 2 jRip, 1, 3, 6, -3 j48, 1, 3, 6, -3 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

manual down wins?

20

“areas” in PC1 which2, 0.0, 0.0, 65.0, 71.1, 81.8, [ | ++++++ ] manualUp, 52.1, 58.4, 60.6, 63.4, 71.6, [ ----|+++++ ] nBayes, 36.4, 46.1, 51.5, 53.4, 60.9, [ ----- |++++ ] manualDown, 32.3, 41.9, 44.6, 46.2, 55.3, [ ----- |+++++ ] j48, 3.1, 12.5, 19.2, 24.6, 41.5, [----- | +++++++++ ] jRip, 0.0, 11.0, 15.1, 23.2, 30.8, [----- | ++++ ] which8, 0.0, 9.1, 22.6, 30.7, 47.7, [---- | +++++++++ ] which8loc, 0.0, 0.0, 7.4, 12.7, 22.1, [ | +++++ ] which4loc, 0.0, 0.0, 3.8, 14.8, 30.3, [| ++++++++ ] which4, 0.0, 0.0, 0.0, 50.3, 59.0, [ +++++ ] which2loc, 0.0, 0.0, 0.0, 9.7, 26.3, [ +++++++++ ]

#key, ties, win, loss, win-loss @ 99% manualUp, 1, 9, 0, 9 which2, 2, 8, 0, 8 nBayes, 1, 8, 1, 7 manualDown, 1, 6, 3, 3 which8, 3, 3, 4, -1 jRip, 3, 3, 4, -1 j48, 3, 3, 4, -1 which4, 7, 0, 3, -3 which8loc, 3, 0, 7, -7 which4loc, 3, 0, 7, -7 which2loc, 3, 0, 7, -7

21

Results type #2:Results type #2:2/8 examples2/8 examples

Manual worse than Manual worse than

(WHICH or traditional data (WHICH or traditional data miners)miners)

22

“areas” in KC3_mod which2, 73.3, 82.4, 87.3, 90.5, 95.4, [ ----- | +++ ] nBayes, 45.5, 59.2, 64.2, 69.6, 75.4, [ ------- | +++ ] manualUp, 50.7, 57.5, 64.2, 68.1, 77.4, [ ---- | +++++ ] which4, 0.0, 40.5, 47.8, 58.6, 67.2, [-------------------- | +++++ ] manualDown, 31.3, 39.5, 47.6, 55.6, 66.8, [ ----- | ++++++ ] which8, 0.0, 36.2, 46.7, 52.7, 62.1, [------------------ | +++++ ] j48, 0.0, 13.6, 23.1, 28.9, 42.6, [------ | +++++++ ] jRip, 0.0, 13.1, 17.7, 23.9, 54.2, [------ | ++++++++++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 43.0, [+++++++++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 8.3, 19.7, [ ++++++ ] which2loc, 0.0, 0.0, 6.6, 18.9, 39.9, [ | +++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 nBayes, 1, 8, 1, 7 manualUp, 1, 8, 1, 7 which8, 2, 5, 3, 2 which4, 2, 5, 3, 2 manualDown, 2, 5, 3, 2 j48, 1, 3, 6, -3 jRip, 2, 2, 6, -4 which2loc, 2, 1, 7, -6 which4loc, 2, 0, 8, -8 which8loc, 1, 0, 9, -9

23

“areas” in PC3_mod which2, 70.6, 76.0, 79.3, 82.7, 88.4, [ --- | +++ ] nBayes, 58.8, 63.0, 67.4, 69.0, 75.4, [ --- |++++ ] which4, 56.2, 62.2, 65.3, 68.3, 77.5, [ --- | +++++ ] manualDown, 48.9, 55.3, 57.5, 60.1, 65.2, [ ----| +++ ] manualUp, 43.1, 47.7, 49.9, 52.4, 59.0, [ ---| ++++ ] j48, 0.0, 17.4, 22.7, 26.3, 36.5, [-------- | ++++++ ] which8, 0.0, 13.6, 31.9, 36.7, 43.7, [------ | ++++ ] jRip, 0.0, 6.3, 12.5, 19.4, 34.4, [--- | ++++++++ ] which4loc, 0.0, 2.1, 5.6, 9.8, 16.4, [-| ++++ ] which8loc, 0.0, 0.0, 0.0, 4.1, 16.1, [ ++++++ ] which2loc, 0.0, 0.0, 1.9, 6.6, 21.5, [ ++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 which4, 1, 8, 1, 7 nBayes, 1, 8, 1, 7 manualDown, 0, 7, 3, 4 manualUp, 0, 6, 4, 2 which8, 1, 4, 5, -1 j48, 1, 4, 5, -1 jRip, 0, 3, 7, -4 which4loc, 1, 1, 8, -7 which2loc, 2, 0, 8, -8 which8loc, 1, 0, 9, -9

manual down wins?

24

OnceOnceManual beatsManual beats

( WHICH or traditional data ( WHICH or traditional data miners)miners)

25

“areas” in MC2_mod manualUp, 63.3, 70.9, 74.3, 78.3, 80.4, [ ---- | ++ ] nBayes, 21.4, 46.6, 55.9, 59.1, 79.1, [ ------------- | ++++++++++ ] manualDown, 29.7, 38.1, 42.8, 47.2, 57.4, [ ----- | ++++++ ] j48, 21.9, 29.3, 43.7, 55.4, 69.7, [ ---- | ++++++++ ] jRip, 12.7, 17.0, 28.5, 35.2, 56.4, [ --- | +++++++++++ ] which8, 0.0, 11.2, 21.9, 27.4, 42.4, [----- | ++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 29.8, [++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 5.6, 14.9, [ +++++ ] which4, 0.0, 0.0, 5.6, 25.3, 47.9, [ | ++++++++++++ ] which2loc, 0.0, 0.0, 0.0, 0.0, 21.0, [++++++++++ ] which2, 0.0, 0.0, 0.0, 40.8, 99.7, [ ++++++++++++++++++++++++++++++ ]

#key, ties, win, loss, win-loss @ 99% manualUp, 0, 10, 0, 10 nBayes, 0, 9, 1, 8 manualDown, 1, 7, 2, 5 j48, 1, 7, 2, 5 jRip, 1, 5, 4, 1 which8, 3, 3, 4, -1 which4, 4, 1, 5, -4 which2, 5, 0, 5, -5 which4loc, 4, 0, 6, -6 which2loc, 4, 0, 6, -6 which8loc, 3, 0, 7, -7

26

OverallOverall

WHICH2 > manual > traditionalWHICH2 > manual > traditional

27

Across all data sets which2, 0.0, 66.8, 77.6, 85.6, 99.7, [--------------------------------- | ++++++++ ] manualUp, 37.1, 56.5, 63.7, 70.2, 80.4, [ ---------- | ++++++ ] nBayes, 19.5, 52.9, 61.2, 69.6, 82.4, [ ----------------- | +++++++ ] manualDown, 29.7, 42.3, 46.4, 53.4, 71.8, [ ------- | ++++++++++ ] which4, 0.0, 35.6, 53.7, 63.9, 96.7, [----------------- | +++++++++++++++++ ] which8, 0.0, 18.6, 35.5, 47.0, 92.5, [--------- | +++++++++++++++++++++++ ] j48, 0.0, 18.3, 27.9, 42.9, 72.0, [--------- | +++++++++++++++ ] jRip, 0.0, 13.3, 23.9, 39.7, 65.2, [------ | +++++++++++++ ] which8loc, 0.0, 0.0, 0.0, 6.7, 92.5, [ +++++++++++++++++++++++++++++++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 9.8, 96.7, [ ++++++++++++++++++++++++++++++++++++++++++++ ] which2loc, 0.0, 0.0, 0.0, 11.2, 97.0, [ +++++++++++++++++++++++++++++++++++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 nBayes, 1, 8, 1, 7 manualUp, 1, 8, 1, 7 which4, 0, 7, 3, 4 manualDown, 0, 6, 4, 2 which8, 1, 4, 5, -1 j48, 1, 4, 5, -1 jRip, 0, 3, 7, -4 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

28

ConclusionsConclusions

29

Overall• Don’t assess learners without a usage

context. – Here: context = “read less, find more”

• Some support for the Koru hypothesis• Value of manual (up or down) questionable

– Only outstandingly better in one data set– And worse than other methods in 4/10 data sets

• WHICH2– The general winner– Near optimum

• Min: 0% • Lower quartile: 67% • Median: 78%• 3rd quartile: 86%• Max: 99%

Still room forStill room forimprovementimprovement

30

Early stopping rules(useful, a little interesting)

% defectiveModulesdetected

% LOC read

optimal

detector1

Watch inspection rules to learn when enough is enough

31

Learning the actual number of defects(very useful, very interesting)

% defectiveModulesdetected

% LOC read

Curve1 = optimal - real defects

curve2 = inspections

Q:Can we learn curve1 from watching the growth of curve2?

A: Maybe. WHICH2’ s (50%,75%) percentile = (79%, 86%) (I.e. getting pretty close to curve2)