defect prediction
Transcript of defect prediction
-
8/7/2019 defect prediction
1/29
Introduction to Defect
Prediction
Cmpe 589
Spring 2008
-
8/7/2019 defect prediction
2/29
-
8/7/2019 defect prediction
3/29
Problem 2
How hard will it befor anotherorganization tomaintain this
software? McCabe Complexity
-
8/7/2019 defect prediction
4/29
-
8/7/2019 defect prediction
5/29
Problem Definition Software development
lifecycle: Requirements Design Development Test (Takes ~50% of overall time)
Detect and correctdefects before
delivering software. Test strategies:
Expert judgment Manual code reviews Oracles/ Predictors as secondary
tools
-
8/7/2019 defect prediction
6/29
Problem Definition
-
8/7/2019 defect prediction
7/29
Testing
-
8/7/2019 defect prediction
8/29
Defect Prediction
2-Class Classification Problem.
Non-defective If error = 0
Defective If error > 0
2 things needed: Raw data: Source code Software Metrics -> Static Code
Attributes
-
8/7/2019 defect prediction
9/29
-
8/7/2019 defect prediction
10/29
Static Code Attributes void main() { //This is a sample code
//Declare variables int a, b, c;
// Initialize variables a=2; b=5;
//Find the sum and display c if greaterthan zero
c=sum(a,b); if c < 0 printf(%d\n, a); return; }
int sum(int a, int b) { // Returns the sum of two numbers return a+b; }
c > 0
c
Module LOC LOCC V CCError
main() 16 4 5 2 2
sum() 5 1 3 1 0
LOC: Line of Code
LOCC: Line ofcommented Code
V: Number of unique operands&operators
CC: Cyclometric Complexity
-
8/7/2019 defect prediction
11/29
-
8/7/2019 defect prediction
12/29
-
8/7/2019 defect prediction
13/29
+
-
8/7/2019 defect prediction
14/29
Defect Prediction
Machine Learning based models.
Defect density estimation
Regression models: error pronness
First classification then regression
Defect prediction between versions Defect prediction for embedded systems
-
8/7/2019 defect prediction
15/29
Constructing Predictors
Baseline: Naive Bayes.
Why?: Best reported results so far(Menzies et al., 2007)
Remove assumptions and construct
different models. Independent Attributes ->Multivariate dist.
Attributes of equal importance
-
8/7/2019 defect prediction
16/29
Weighted Naive Bayes
))(log(2
1)(
2
1
i
d
j j
ij
t
j
i CPs
mxxg
!
!
Naive Bayes
Weighted Naive Bayes ))(log(2
1)(
2
1
i
d
j j
ij
t
j
ji CPs
mxwxg
!
!
-
8/7/2019 defect prediction
17/29
Datasets
Name # Features #Modules Defect Rate(%)
CM1 38 505 9
PC1 38 1107 6
PC2 38 5589 0.6
PC3 38 1563 10
PC4 38 1458 12
KC3 38 458 9
KC4 38 125 40
MW1 38 403 9
-
8/7/2019 defect prediction
18/29
Performance Measures
DefectsActual
no yes
Prdno A B
yes C D
Accuracy: (A+D)/(A+B+C+D)
Pd (Hit Rate): D / (B+D)
Pf (False Alarm Rate): C / (A+C)
-
8/7/2019 defect prediction
19/29
-
8/7/2019 defect prediction
20/29
Results: InfoGain&GainRatio
DataWNB+IG (%) WNB+GR (%) IG+NB (%)
pd pf bal pd pf bal pd pf balCM1 82 39 70 82 39 70 83 32 74
PC1 69 35 67 69 35 67 40 12 57
PC2 72 15 77 66 20 72 72 15 77
PC3 80 35 71 81 35 72 60 15 70
PC4 88 27 79 87 24 81 92 29 78
KC3 80 27 76 83 30 76 48 15 62
KC4 77 35 70 78 35 71 79 33 72
MW1 70 38 66 68 34 67 44 07 60
Avg: 77 31 72 77 32 72 65 20 61
-
8/7/2019 defect prediction
21/29
Results: Weight Assignments
0 5 10 15 20 25 30 35 400
2
4
6
8
10
12
14
16
Enumerated Metrics
CumilativeG
ainRatioFeatureWeights
GainRatio Weights
CM1
PC1
PC2
PC3
PC4
KC1
KC3
MW1
-
8/7/2019 defect prediction
22/29
Benefiting from defect data in practice
Within Company vs Cross Company Data
Investigated in cost estimation literature
No studies in defect prediction!
No conclusions in cost estimation
Straight forward interpretation of results indefect prediction.
Possible reason: well defined features.
-
8/7/2019 defect prediction
23/29
How much data do we need?
Consider:
Dataset size:1000
Defect rate: 8%
Training instances: %90
1000*8%*90%=72 defective instances
(1000-72) non-defective instances
-
8/7/2019 defect prediction
24/29
Intelligent data sampling
With random sampling of 100 instanceswe can learn as well as thousands.
Can we increase the performance withwiser sampling strategies?
Which data?
Practical aspects: Industrial case study.
-
8/7/2019 defect prediction
25/29
ICSOFT07
WC vs CC Data? When to use WC or CC?
How much data do we need to construct a
model?
-
8/7/2019 defect prediction
26/29
ICSOFT07
-
8/7/2019 defect prediction
27/29
-
8/7/2019 defect prediction
28/29
Module Structure vs Defect Rate
Fan-in, fan-out Page Rank Algorithm
Call graph information on the code
small is beautiful
-
8/7/2019 defect prediction
29/29
Performance vs. Granularity
0
20
40
60
80
100
120
Statement Method Class File Component Project
Performance
Granularity