Cukic Promise08 V3
-
Upload
gregoryg -
Category
Technology
-
view
1.202 -
download
1
description
Transcript of Cukic Promise08 V3
HASLHASL High Assurance Systems Lab
Comparing Design and Code Metrics for Software Quality
Prediction
Y. Jiang, B. Cukic, T. MenziesLane Department of CSEE
West Virginia University
PROMISE 2008
HASLHASL High Assurance Systems Lab
Predicting Faults Earlier Matters
• Boehm observed that fault removal is 50 to 200 times less costly when performed in the design phase rather than after the deployment.
• NASA research shows that a fault introduced in the requirements, which leaks into the design, code, test, integration, and operational phases, ensues the correction cost factors of 5, 10, 50, 130, and 368, respectively.
• Therefore, the earlier we can identify fault-prone artifacts, the better.
HASLHASL High Assurance Systems Lab
How early?
• Do requirements metrics correlate with fault-proneness? [Jiang et. al. ISSRE 07]
HASLHASL High Assurance Systems Lab
Predicting From Design Metrics?
• It has been successfully demonstrated.• Ohlsson and Alberg (’96) demonstrated that design
metrics predict fault prone modules effectively.– “Design metrics are better predictors than code size (LOC)”– Telephone switching domain
• Basili validated so called CK object oriented (design) metrics using eight student developed systems.
• Nagappan, Ball & Zeller confirmed Ohlsson’s findings using OO design metrics on five Microsoft systems
HASLHASL High Assurance Systems Lab
Goal of This Study
• Thorough comparison of fault prediction models which utilize:
1. Design metrics
2. Static code metrics
3. Combination of both
• Statistically significant number of projects and modules within projects.
HASLHASL High Assurance Systems Lab
Metrics Description (1)• Code metrics
HASLHASL High Assurance Systems Lab
Metrics Description (2)• Design metrics
HASLHASL High Assurance Systems Lab
Experimental Design
Classification10x10 CV
Illustrate ResultsUsing ROC
Code
Visualize Using
Boxplot diagrams
Compare using NonparametricStatistical Tests
Evaluate ResultsUsing AUC
trapezoid rule
DesignAll
HASLHASL High Assurance Systems Lab
Datasets: NASA MDP
• Used every dataset which offered both design and code level metrics.
HASLHASL High Assurance Systems Lab
Experimental Design (2)
• 5 classification algorithms– Random forest, begging, boosting, logistic regression,
NaiveBayes
• 10 by 10 way cross-validation: – one 10 way experiment generates an ROC curve
=> 10 ROCs => 10 AUCs
• We analyzed 1950 experiments!– 13 [Data sets] *3 [Metrics sets] *5 [Classifiers] *10 [CV]
• We only show the best model from each metrics set in each data set (project).
HASLHASL High Assurance Systems Lab
Analysis example: PC5 data set
ROC
Boxplot
• The mean AUC • All: 0.979 • Code: 0.967 • Design: 0.956.
HASLHASL High Assurance Systems Lab
Typical Results
HASLHASL High Assurance Systems Lab
Not So Typical Results
HASLHASL High Assurance Systems Lab
Atypical Results
HASLHASL High Assurance Systems Lab
Test Statistical Significance
• Use the procedure recommended by Demsar for each of the 13 data sets.
1. Friedman test tests whether performance differs amongst design, code, and all experiments.
• If no, no further test is necessary.
• If yes, then 2. Use pairwise nonparametric tests (typically the Wilcoxon test
or the Mann-Whitney test) to determine which group of metrics is the best.
• 95% confidence level used in all experiments
HASLHASL High Assurance Systems Lab
Pairwise comparison
• Test the following hypotheses for pairwise comparison of two experiments A and B.
H0: There is no difference in the performance of the models from metrics from group A and group B;
H1: The performance of the group A metrics is better than that of group B metrics;
H2: The performance of the group A metrics is worse than that of group B metrics.
HASLHASL High Assurance Systems Lab
The Result of Hypothesis Test (1)
• Friedman’s test– Average p-value = 0.00003604 (<0.05) – Strongly suggests there is statistically significant difference
amongst the models from all, code, and design over all 13 datasets.
• Two pairwise nonparametric tests (the Wilcoxon test or the Mann-Whitney test) agree in all cases but one– PC2: the Mann-Whitney has all >code, but the Wilcoxon has
all=code, – This discrepancy does not affect our overall trend.
HASLHASL High Assurance Systems Lab
HASLHASL High Assurance Systems Lab
Findings
• Statistical significance tests utilized AUC for model comparison
• In 7 datasets: all=code; • In 6 datasets, all >code. • In all 13 datasets, all>design. • In 12 datasets, code>design. • Only exception is KC4 project, where
design>code.
HASLHASL High Assurance Systems Lab
Summary of Observations • The performance of models is influenced
– MORE by metrics – THAN by classification algorithms.
• Combination of design AND code metrics provides better models than code or design metrics alone.
• The models from code metrics generally perform better than that formed from design metrics only.
• Design metrics useful to predict fault prone modules earlier.
• Clear indication that integrating metrics from different phases of development is useful.
HASLHASL High Assurance Systems Lab
Threats to Validity
• Noise in the metrics data sets.– Would feature selection change some outcomes?
• Generality of NASA datasets.
• Design metrics reengineered from code.– More accurately reflect the code base than those computed from
design documentation.
• All metrics data contains a few independent variables which are not in Code or Design groups.– Needs correction, but the results unlikely to change.
HASLHASL High Assurance Systems Lab
Ensuing Research
• Software fault prediction can be improved– Improvement unlikely to come from the application of more off-
the-shelf data mining algorithms. – Accounting for project’s “business context” may contribute to
improvement.• Metrics from different development stages add information
not available from the code.
• Evaluation of effectiveness should be tailored to project-
specific (subsystem/module-specific) risks. – Reliable metrics collection.