Predicting Number of Software Defects and Defect ...
Transcript of Predicting Number of Software Defects and Defect ...
Predicting Number of Software Defects and Defect Remediation Time to Minimize
Leakage and Allocate Rework Efforts in an Upcoming Software Release
by Christine Adrian Sigalla
B.S. in Triple Majors: Management Information Systems, Accounting, and Business Management, May 2014, La Roche University
M.S. in Systems Engineering, May 2018, The George Washington University
A Praxis submitted to
The Faculty of
The School of Engineering and Applied Science of The George Washington University
in partial fulfillment of the requirements for the degree of Doctor of Engineering
August 31, 2020
Praxis directed by
Amir Etemadi Associate Professor of Engineering and Applied Science
Oluwatomi Adetunji
Professorial Lecturer of Engineering Management and Systems Engineering
ii
The School of Engineering and Applied Science of The George Washington University
certifies that Christine Adrian Sigalla has passed the Final Examination for the degree of
Doctor of Engineering as of July 24, 2020. This is the final and approved form of the
Praxis.
Predicting Number of Software Defects and Defect Remediation Time to Minimize
Leakage and Allocate Rework Efforts in an Upcoming Software Release
Christine Adrian Sigalla
Praxis Research Committee:
Amir Etemadi, Associate Professor of Engineering and Applied Science, Praxis Co-Director
Oluwatomi Adetunji, Professorial Lecturer of Engineering Management and Systems Engineering, Praxis Co-Director
Thomas Holzer, Professorial Lecturer of Engineering Management and Systems Engineering, Committee Member
iii
© Copyright 2020 by Christine Adrian Sigalla All rights reserved
iv
Dedication
I dedicate this research to my husband, parents, sisters, extended family, in-laws,
professors, managing director, manager, friends, colleagues, and employer for supporting
and being patient with me during this 2-year journey of demanding and unbalanced life.
My husband deserves an honorable dedication for being my biggest motivator and
supporter behind all of my accomplishments. My husband stepped up and took care of all
house maintenance and domestic errands while I concentrated on my studies.
I would like to thank my mother and sisters, who never stopped praying for me
even when I felt like giving up at the beginning of the research phase. My father inspired
me to pursue this doctoral degree and challenged me to overcome all obstacles and
become the first generation of the clan to ever receive the doctorate with honors.
An honorable dedication goes to my late mother-in-law, who passed away during
my final year of doctoral research. I am so thankful that I had a chance to spend time with
her during her final days, and she was very supportive of my education and career
journey. A special dedication goes to my advisors for being patient, understanding, and
offering valuable advice that directed me to the right path during the challenging 2-year
journey.
v
Acknowledgements
I wish to thank my family and friends for making this 2-year journey possible. I thank my
advisors, Dr. Etemadi and Dr. Adetunji, for providing valuable advice in completing the
praxis. I also acknowledge my employer for providing partial tuition reimbursement
towards my education.
vi
Abstract of Praxis
Predicting Number of Software Defects and Defect Remediation Time to Minimize
Leakage and Allocate Rework Efforts in an Upcoming Software Release
Information Technology companies spend substantial resources fixing the damage
caused by software defects. Defect remediation time is an important metric in allocating
rework efforts (resources) for fixing the defects. The aim of this praxis is to use statistical
learning models to predict the number of defects and defect remediation time prior to
testing. Obtaining information from these models is valuable because it gives software
engineering managers a better method by which to minimize the leakage of the defects
and allocate rework efforts in an upcoming software release.
The predictors for number of defects are: total number of components delivered,
code size (lines of code), total number of developers working on code components, total
number of requirements, and total number of test cases. The predictors for defect
remediation time are: total number of test cases, total number of requirements, number of
defects, and code size. Previous studies have used these predictors individually in
predicting the number of defects and defect remediation time. However, none of the
previous studies have considered combining all the predictors in their predictions.
This praxis addresses a gap in previous software industry research by proposing
the number of defects and defect remediation time predictions using 202 mainframe
languages software projects over 4 years of a dataset containing 1,143 defects, and the
combined influence of all the predictors. The proposed statistical learning models used in
this praxis are negative binomial regression, multiple linear regression, random forest,
and support vector machine. If the number of defects and defect remediation time can be
vii
predicted, both software managers and researchers will benefit from this research by
applying statistical learning models to minimize defect leakage and allocate rework
efforts.
viii
Table of Contents
Dedication ......................................................................................................................... iv
Acknowledgements ........................................................................................................... v
Abstract of Praxis ............................................................................................................ vi
List of Figures .................................................................................................................. xii
List of Tables .................................................................................................................. xiv
List of Symbols ................................................................................................................ xv
List of Acronyms ............................................................................................................ xvi
Chapter 1—Introduction ..................................................................................................... 1
1.1 Background ....................................................................................................... 1
1.2 Research Motivation ......................................................................................... 3
1.3 Problem Statement ............................................................................................ 4
1.4 Thesis Statement ............................................................................................... 4
1.5 Research Objectives .......................................................................................... 5
1.6 Research Questions and Hypotheses ................................................................ 5
1.7 Scope of Research ............................................................................................. 6
1.8 Research Limitations ........................................................................................ 7
1.9 Organization of Praxis ...................................................................................... 7
Chapter 2—Literature Review ............................................................................................ 9
2.1 Introduction ....................................................................................................... 9
2.2 Software Defects Prediction Metrics .............................................................. 11
2.2.1 Total Number of Developers (TNOD) ................................................... 11
2.2.2 Number of Components Delivered (NOCD) ......................................... 14
ix
2.2.3 Total Number of Requirements (TR) .................................................... 14
2.2.4 Total Number of Test Cases (TTC) ...................................................... 15
2.2.5 Code Size - Lines of Code (LOC).......................................................... 15
2.2.6 Summary of Metrics for Software Defect Prediction ............................ 16
2.3 Root Causes of Software Rework ................................................................... 17
2.3.1 Introduction: Software Rework.............................................................. 17
2.3.2 Root Causes Analysis of Software Rework ........................................... 19
2.3.3 Possible Ways of Reducing Avoidable Software Rework ..................... 21
2.4 Defect Remediation Time Prediction Metrics ................................................ 22
2.4.1 Code Size - Lines of Code (LOC).......................................................... 22
2.4.2 Number of Defects (NOD) ..................................................................... 23
2.4.3 Total Number of Test Cases (TTC) ....................................................... 23
2.4.4 Total Number of Requirements (TR) ..................................................... 24
2.4.5 Summary of Metrics for Defect Remediation Time Prediction ............. 24
2.5 Summary and Conclusion ............................................................................... 25
Chapter 3—Methodology ................................................................................................. 27
3.1 Introduction ..................................................................................................... 27
3.1.1 Data ........................................................................................................ 28
3.1.2 Data Description .................................................................................... 28
3.1.3 Proposed Approaches............................................................................. 29
3.2 Regression Techniques ................................................................................... 33
3.2.1 Negative Binomial Regression Model ................................................... 33
3.2.2 Multiple Linear Regression Model ........................................................ 35
x
3.3 Classification Techniques ............................................................................... 36
3.3.1 Random Forest ....................................................................................... 36
3.3.2 Support Vector Machine Model............................................................. 41
Chapter 4—Results ........................................................................................................... 44
4.1 Analysis of Significant Predictors for Software Defects ................................ 44
4.1.1 Defects Data Collection and Cleaning ................................................... 45
4.1.2 Negative Binomial Regression Summary & Partial Dependency Plots
for Defects Significant Predictors ................................................................... 45
4.2 Analysis of Software Defect Prediction Model .............................................. 50
4.2.1 Data Partition for Defects Prediction ..................................................... 51
4.2.2 Variable Importance Plot Using Defects Data ....................................... 51
4.2.3 Development of Software Defects Prediction........................................ 52
4.2.4 Measures of Model Accuracy for Software Defects Prediction ............ 53
4.2.5 Results of Software Defects Prediction Model ...................................... 54
4.3 Analysis of Significant Predictors for Software Defect Remediation Time ... 56
4.3.1 Data Collection and Cleaning for Defect Remediation Time ................ 57
4.3.2 Multiple Linear Regression (MLR) Model Summary ........................... 57
4.3.3 Partial Dependency Plots for Significant Predictor(s) of Defect
Remediation Time ........................................................................................... 58
4.4 Analysis of Defect remediation Time Prediction Model ................................ 60
4.4.1 Data Partition for Defect Remediation Time Prediction ........................ 60
4.4.2 Variable Importance Plot Using Defect Remediation Time Data ......... 61
4.4.3. Development of Software Defect Remediation Time Prediction ......... 61
xi
4.4.4 Model Accuracy Measures for Defect Remediation Time .................... 62
4.4.5 Results of Defect Remediation Time Prediction Model ........................ 63
Chapter 5—Discussion and Conclusions .......................................................................... 67
5.1 Discussion and Conclusions ........................................................................... 67
5.2 Contributions to Body of Knowledge ............................................................. 69
5.3 Recommendations for Future Research .......................................................... 69
References ......................................................................................................................... 71
Appendix A—Dataset for Defects and Defect Remediation Time ................................... 77
Appendix B—Metrics for Defects and Defect Remediation time .................................... 79
Appendix C—Models Development & Results ................................................................ 82
Appendix D—Measures of Model Performance .............................................................. 90
xii
List of Figures
Figure 3-1. High-level Overview of Building a Model. ................................................... 30
Figure 3-2. Process Flow to Identify Significant Predictors for NOD and DRT. ............. 31
Figure 3-3. Process Flow for SDP and DRT Prediction. .................................................. 32
Figure 3-4. Data Subset: Random Selection of Data. ....................................................... 37
Figure 3-5. Independent Variables Set: Random Selection of Variables. ........................ 37
Figure 3-6. RF Classification Process. .............................................................................. 39
Figure 3-7. 2-Dimensional Hyperplane and 3-Dimensional Hyperplane. ........................ 41
Figure 4-1. NBR Model Summary Result. ....................................................................... 46
Figure 4-2. Relationship between Number of Defects and LOC. ..................................... 48
Figure 4-3. Relationship between Number of Defects and NOCD. ................................. 49
Figure 4-4. Relationship between Number of Defects and TTC ...................................... 50
Figure 4-5. Variable Importance Plot for Software Defects. ............................................ 52
Figure 4-6. Random Forest Model Result for Defects Prediction. ................................... 53
Figure 4-7. Support Vector Machine Model Result for Defects Prediction. .................... 53
Figure 4-8. Number of Predicted Defects vs. Actual Defects. .......................................... 55
Figure 4-9. Actual Versus Predicted Defects Graph. ........................................................ 56
Figure 4-10. Multiple Linear Regression Model Result ................................................... 58
Figure 4-11. Relationship between Defect Remediation Time and NOD. ....................... 59
Figure 4-12. Variable Importance Plot for Defect Remediation Time. ............................ 61
Figure 4-13. Random Forest Model Result for DRT Prediction. ..................................... 62
Figure 4-14. Support Vector Machine Model Result for DRT Prediction. ...................... 62
Figure 4-15. Predicted Defect Remediation Time vs. Actual Defect Remediation Time. 64
Figure 4-16. Actual vs. Predicted Defect Remediation Time Graph. ............................... 65
xiii
Figure A-1. Defects. .......................................................................................................... 78
Figure A-2. Defect Remediation Time. ............................................................................ 78
Figure C-1. NBR Model Summary Result. ....................................................................... 83
Figure C-2. Number of Predicted Defects vs. Actual Defects. ......................................... 86
Figure C-3. Multiple Linear Regression Model Result. ................................................... 87
Figure C-4. Predicted Defect Remediation Time vs. Actual Defect Remediation Time. . 89
xiv
List of Tables
Table 2-1. Summary of Predictors for Software Defect Prediction .................................. 17
Table 2-2. Summary of Predictors for Defect Remediation Time Prediction .................. 25
Table 3-1. Metrics Definitions and Abbreviations ........................................................... 28
Table 4-1. Data Partition for Software Defects Prediction ............................................... 51
Table 4-2. Measure of Errors for Software Defect Prediction .......................................... 54
Table 4-3. Data Partition for Defect Remediation Time Prediction ................................. 60
Table 4-4. Measure of Errors for Defect Remediation Time Prediction .......................... 63
Table 4-5. Summary Table………………………………………………...…………….66
Table B-1. Metrics Definitions and Abbreviations ........................................................... 79
Table B-2. Summary of Predictors for Software Defect Prediction ................................. 80
Table B-3. Summary of Predictors for Defect Remediation Time Prediction .................. 81
Table D-1. Measure of Errors for Software Defects Prediction ....................................... 93
Table D-2. Measure of Errors for Defect Remediation Time Prediction ......................... 96
xv
List of Symbols
� Predictor / Independent Variable
� Response / Dependent Variable
∈ Variant Epsilon / to mean “belongs to” or “is in the set of”
K Value of Response Variable
E Error
Pr Probability
Γ Gamma Function
Ix Variable Importance Score
Vi Vector
λ Variance of Y
r Dispersion parameter
n Dimensional Space
errorOOBn Out-Of-Bag Error
xvi
List of Acronyms
LOC Lines of Code
RF Random Forest
SVM Support Vector Machine
NBR Negative Binomial Regression
MLR Multiple Linear Regression
IT Information Technology
NOD Number of Defects
DRT Defect Remediation Time
IEEE Institute of Electrical and Electronics Engineers
COBOL Common Business-Oriented Language
JCL Job Control Language
CPY Copybook
TTC Total Number of Test Cases
TR Total Number of Requirements
NOCD Number of Components Delivered
TNOD Total Number of Developers
SCM Software Configuration Management
PD Partial Dependency
PMI Project Management Institute
RL Release
UAT User Acceptance Testing
1
Chapter 1—Introduction
1.1 Background
A software defect, commonly referred to as a “bug,” is an error in the software
source code that makes the software product function in unintended ways, yielding
unexpected results. In the continuous expansion of new technology, software defects
have become a major concern in the software industry. The probability of having defect-
free software has become difficult to achieve because of the complexity in the software
source code, which leads to software failure. The failure to capture defects during the
development phase of software engineering can result in system downtime, rework
efforts, and overheads in production (Harekal & Suma, 2015).
Information Technology (IT) companies are constantly working and spending
substantial resources in finding and fixing the damage caused by software defects in
order to deliver high quality software products to their customers and attain customer
satisfaction. Finding and fixing the defects after the software product has been delivered
to stakeholders is expensive; therefore, there is a need for predicting software defects
prior to testing (Felix & Lee, 2017; Harekal & Suma, 2015). In software engineering, the
failure to capture defects “during pre production time of the software certainly leads
towards defect leakage” (Harekal & Suma, 2015, p.20).
Software defect prediction refers to a method of predicting defective modules
(source code components) by analyzing past historical data and building statistical
learning classifiers to improve software reliability. Software reliability is defined as the
likelihood of a software product being free from defects. Identifying defective modules
2
early in the development of software will improve the software quality and help software
engineers optimize resource allocation for fixing defects while supporting the
development, testing, and maintenance of the software (Fan et al., 2019; Li et al., 2018).
Software engineering managers are faced with a challenge of predicting rework
efforts during the software development planning process. Rework effort is defined as the
“effort [resources] required to fix the software defects identified during system testing”
(Bhardwaj & Rana, 2015, p.1). The role of the manager is to ensure the software product
satisfies the client’s needs, is delivered on time, and on budget. In order to determine the
rework efforts, it is important to predict the number of defects first, followed by the
defect remediation effort (time).
Defect fixing (remediation) effort is the “effort [time] required in person-hours to
fix a defect” (Goel & Singh, 2011, p.124). In this praxis, defect remediation time is
expressed in hours rather than person-hours. The purpose of predicting defect
remediation time prior to testing is to assist software engineering managers in planning
testing efforts, prioritizing work, and allocating appropriate resources to fix the defects in
a situation where there is a high volume of defects (Akbarinasaji et al., 2018; Harekal &
Suma, 2015).
This praxis aims at predicting number of defects and defect remediation time in
order to minimize the defect leakage and allocate resources to fix the defects prior to
testing using statistical learning models. The defect remediation time prediction can also
be used to improve defect correction time allocation in the project schedule. However,
project schedule is not included on the dataset for this research. The statistical learning
models discussed in the context of this praxis are negative binomial regression, multiple
3
linear regression, random forest, and support vector machine. To predict the number of
defects, the following predictors are used as an input to the statistical learning models:
total number of components delivered, code size (lines of code), total number of
developers working on code components, total number of requirements, and total number
of test case (Dhiauddin et al., 2012; Di Nucci et al., 2018; Kumar & Malik, 2019; Umar,
2013). To predict defect remediation time, the following predictors are used as an input to
the statistical learning model: total number of test cases, total number of requirements,
number of defects, and code size (Goel & Sing, 2011; Ramdoo & Huzooree, 2015).
1.2 Research Motivation
Recent studies (Dhiauddin et al., 2012; Di Nucci et al., 2018; Kumar & Malik,
2019; Umar, 2013; Goel & Sing, 2011; Ramdoo & Huzooree, 2015) have used these
predictors individually in predicting number of defects and defect remediation time using
object-oriented programming software projects written in JAVA, Python, JavaScript,
PHP, Ruby, and Scala. However, none of the studies have considered combining all the
predictors and using the procedural programming software projects written in mainframe
language (Common Business-Oriented Language [COBOL]) in their predictions.
The combined influence of all the predictors in predicting software defects and
defect remediation time using COBOL projects is the missing piece in the technical
literature and the software industry in general. This research predicts the number of
defects first, followed by the defect remediation time, and the results suggests that the
identified predictors can be used to predict the number of defects, and defect remediation
time.
4
1.3 Problem Statement
Failure to capture defects during software development results in defect leakage,
causing system down time, overheads in production and rework efforts which consume
up to 70% of the allocated budget for software development project.
Software defects that are discovered during the post production phase tend to
cause system outage, rework efforts, and overheads, which can affect the client and
vendor business relationship due to client dissatisfaction and software malfunctions. In
this situation, the client is forced to spend substantial resources in fixing the damages
caused by the vendor inability to identify defects prior to the software release. Due to
tight project schedules, it is difficult for all of the defects to be resolved; hence, some of
the defects are moved to the next release with no estimated time to fix the defects (Felix
& Lee, 2017; Ramdoo & Huzooree, 2015 ; Harekal & Suma, 2015).
1.4 Thesis Statement
Statistical learning models are required to forecast future software defects and
defect remediation time prior to testing in order to minimize the leakage of defects and
allocate rework efforts (resources) for fixing the defects in an upcoming software release.
In order to minimize the leakage of defects and allocate resources to fix defects
prior to testing, statistical learning methods are used to predict number of defects and
defect remediation time based on all of the predictors. The statistical learning methods
used are multiple linear regression, negative binomial regression, random forest, and
support vector machine.
Recent studies have built software defects and defect remediation time prediction
models based on statistical learning predictors, which are useful to object-oriented
5
programming projects only. Therefore, it is difficult for senior management or software
engineering managers to use these models with a mainframe related background.
1.5 Research Objectives
The aim of this praxis is to achieve the following research objectives:
� To determine the significant predictors for predicting number of defects.
� To formulate a statistical learning model to predict number of defects.
� To determine the significant predictors for predicting defect remediation time.
� To formulate a statistical learning model to predict defect remediation time.
The purpose of these research objectives is to propose statistical learning models that
apply the combined influence of all the predictors by predicting the defects and defect
remediation time. The final objective of the predictions is to minimize the leakage of
defects and allocate resources to fix the defects prior to testing.
1.6 Research Questions and Hypotheses
The following research questions (RQ) and hypotheses (H) were used to guide the
praxis and meet the objectives of the research:
RQ1: Code size, total number of components delivered, total number of
developers working on code components, total number of requirements, and total number
of test cases are the predictors influencing the number of defects. Which predictors are
significant in predicting the number of defects?
RQ2: How can statistical learning models forecast the number of defects using
code size, total number of developers working on code components, total number of
components delivered, total number of test cases, and total number of requirements?
6
RQ3: Code size, number of defects, total number of requirements, and total
number of test cases are the predictors influencing the defect remediation time prediction.
Which predictors are significant in predicting defect remediation time?
RQ4: How can statistical learning models forecast the defect remediation time
using code size, number of defects, total number of requirements, and total number of test
cases?
H1: Negative binomial regression and random forest models can identify the most
important predictors for number of defects.
H2: Random Forest and support vector machine models can be used to predict the
number of defects.
H3: Multiple linear regression can identify the important predictors for defect
remediation time.
H4: Random forest and support vector machine models can be used to predict
defect remediation time.
1.7 Scope of Research
This praxis relies on the dataset obtained from the IT firm. The dataset contains 4
years of defect data based on 202 COBOL projects and 1,143 total defects found in User
Acceptance Testing (UAT) environment. The scope of this research is to predict the
number of software defects and defect remediation time following the principles found in
the literature. The praxis is designed to perform the predictions using predictors that have
been individually used in the past by previous studies; however, the novelty of the
research is using the combined influence of all the predictors to perform defects and
defect remediation time predictions. The results obtained from the predictions are going
7
to provide managers a better method to minimize the leakage of defects in production and
assist in allocating resources to fix the defects.
1.8 Research Limitations
This praxis used the IT data containing only 202 COBOL projects to build and
validate the models. Future studies will need to use the object-oriented projects to build
the predictions models for defects and defect remediation time using the same defined
predictors used on COBOL projects and validate the measure of the model accuracy. The
models are not restricted to use with only COBOL projects. This praxis does not show the
calculation of number of resources to fix defects. However, the number of resources to
fix defects is based on the ratio of the predicted defect remediation time (in hours) and 40
hours of working days.
1.9 Organization of Praxis
The praxis contains five chapters and is organized as follows. Chapter 1 includes
the background of the problem, solution, research objectives, research questions and
hypotheses, scope of the research, research limitations, and organization of the praxis.
Chapter 2 presents the literature review of the software defects and defect remediation
time predictions using metrics. It also explains the root causes of software rework, which
are used as additional predictors for predicting defect remediation time. Chapter 3
explains the methodology techniques for identifying significant predictors for defects and
defect remediation time. In addition, it explains the flow and the criteria for evaluating
the best model for predicting defects and defect remediation time. Chapter 4 provides the
8
results of the model based on the written hypotheses. Chapter 5 provides the summary of
the research, conclusions, and future recommendations for researchers looking to extend
or critique the results outcome.
9
Chapter 2—Literature Review
2.1 Introduction
Software defect prediction which aims to predict defective modules prior to
testing is significant and a popular research field in software engineering. The statistical
learning defect prediction model is built based on meaningful metrics and historical data
collected from past releases of procedural programming software projects written in
mainframe languages.
In the past few decades, most studies have been conducted on defect prediction
using open source, object-oriented software projects. The purpose has always been to
minimize the software costs and improve the quality of software by identifying defective
modules prior to testing. The majority of defect prediction models are tested using open
source software projects (Dhiauddin et al., 2012; Di Nucci et al., 2018; Kumar & Malik,
2019; Umar, 2013); therefore, it is challenging for those models to be tested on closed
projects due to privacy preservation issues based on proprietary and commercial
reasoning.
While many previous studies have focused on predicting software defects for the
purpose of providing cost effective support in the development of a software product
(Felix & Lee, 2017), this research focuses on predicting the number of defects prior to
testing in order to minimize the leakage of defects in an upcoming software release using
COBOL software projects.
In this research, the software metrics or predictors for software defect prediction
are total number of components delivered, code size (lines of code), total number of
10
developers working on code components, total number of requirements, and total number
of test cases (Bell et al., 2013; Bird et al., 2011; Dhiauddin et al., 2012; Di Nucci et al.,
2018; Kumar & Malik, 2019; Ostrand et al., 2010; Posnett et al., 2013; Rahman &
Devanbu, 2011; Eyolfson et al., 2011; Umar, 2013).
The predictors for software defects are used as inputs to construct a statistical
learning model based on historical data. The prediction of defect remediation time plays
an important role in allocating resources to fix defects when faced with a high volume of
defects prior to software release. The defect remediation time prediction is a valuable
metric to managers in terms of properly allocating resources to fix defects found in
software development prior to testing.
The software engineering industry has experienced complex and difficult times in
delivering quality software products on time, on budget, and with good quality due to
project risks and improper scheduling of resources to fix defects (Goel & Singh, 2011). It
is challenging to predict the time one can take to fix defects, since some defects tend to
take more time to fix than others. In this way, the number of defects is predicted first,
followed by defect remediation time. The following metrics are applied to predict the
time it takes to fix defects: total number of test cases, total number of requirements,
number of defects, and code size (Goel & Singh, 2011; Ramdoo & Huzooree, 2015).
This chapter firstly introduces the software metrics for predicting software
defects. Secondly, it provides information about identifying the root causes of software
reworks. Finally, it describes the software metrics that will be used for predicting defect
remediation time based on the recent studies conducted by previous researchers. The
overall purpose of this chapter is to explain individual predictors or metrics for software
11
defects and defect remediation time, which have been used by previous researchers, and
to identify a literature gap.
2.2 Software Defects Prediction Metrics
In this section, the predictors for software defects prediction are discussed in
detail based on the studies performed by previous researchers. The predictors for
software defect prediction are total number of developers working on code components ,
number of components delivered, total number of requirements, total number of test
cases, and code size (lines of code).
2.2.1 Total Number of Developers (TNOD)
Previous studies (Bell et al., 2013; Bird et al., 2011; Di Nucci et al., 2018;
Ostrand et al., 2010; Posnett et al., 2013; Rahman & Devanbu, 2011; Eyolfson et al.,
2011) have demonstrated the role of developers in the introduction of defects. Posnett et
al. (2013) observed that developers who focus their attention on working on only one part
of an application software are likely to introduce fewer defects, compared to those who
are unfocussed. Unfocussed developers tend to focus their attention on working on
multiple parts of an application software. This indicates that a developer performing all of
their tasks on a single code component tends to have a higher level of focus on the
component, which makes them less likely to introduce defects. Hence, software modules
changed by focused developers tend to introduce fewer defects as compared to the
modules modified by unfocussed developers.
Di Nucci et al. (2018) applied the Posnett et al. (2013) observation by determining
the focus level of developers working on code components and scattered measures.
Scattered measures refer to the “frequency of changes made by developers over the
12
different system’s modules, but also considers the “distance” between the modified
modules” (Di Nucci et al., 2018,p. 8).Di Nucci et al. (2018) observed that high levels of
scattered measures tend to introduce more defects as compared to the low levels of
scattered measures which introduce fewer defects. Therefore, high levels of scattered
measures are associated with unfocused developers while low levels of scattered
measures are associated with focused developers.
Bell et al. (2013) and Ostrand et al. (2010) investigated whether the files that
contain defects remediated by a specific developer in a current release can assist in
predicting the defects in a file remediated by the same developer in the next release and
improve the accuracy of the standard negative binomial regression model. The purpose of
those studies was to determine if a file modified by a particular developer in a current
release is more or less likely to have defects in the future release than a file modified by a
developer at random.
Bell et al. (2013) and Ostrand et al. (2010) found that knowing a specific
developer who worked on a file is not likely to improve the prediction of defects in a file
but knowing the cumulative number of developers who modified a file can be a
significant variable in predicting defects . This is because an individual developer whose
files have more defects does not necessarily indicate that the developer is
underperforming ; instead, this can imply that the best developers tend to work on
complex files that are very difficult to execute
Eyolfson et al. (2011) contended that developers who have more experience are
less likely to introduce system defects as compared to less experienced developers.
Rahman and Devanbu (2011) examined the effect of developers’ experience and
13
ownership on the module. Rahman and Devanbu (2011) critiqued the observations by
Eyolfson et al. (2011) and showed that there is no link between developer experience and
the introduction of defects.
Bird et al. (2011) examined the relationship between developers’ ownership of
software components and the quality of software. Bird et al. (2011) found that a
developer who makes 80% of code changes on the module is considered to have a high
level of expertise on the module and the module is considered to have a high ownership
level. If many developers make changes on the module, then the developers were
considered to have low expertise on the module, and the module to have a low ownership
level. Bird et al. (2011) research indicated that a high level of component ownership
leads to fewer defects and a low level of component ownership leads to more defects.
The studies mentioned in the previous section have concentrated on the role of
the developers working on a code component in terms of the developer’s experience,
ownership, level of focus on the component, and the possibility of introducing software
defects. This shows the variable “number of developers working on a code component”
has been used in the literature many times.
In this praxis, the number of developers working on code components is applied
as a predictor for software defects prediction. The purpose is to find out if the number of
developers working on code components is statistically significant in predicting software
defects in the case of COBOL projects. This can imply that having many or fewer
developers working on multiple components can introduce higher or lower defect levels.
14
2.2.2 Number of Components Delivered (NOCD)
Umar (2013) identified total number of test cases executed, test team size,
allocated development effort, test case execution effort, and total number of components
delivered as predictors for defects prediction. Umar (2013) observed that there is a strong
correlation between the number of defects and the number of components delivered in the
project. Having more components delivered to the testing phase of software development
life cycle indicates that there is a higher chance of getting defects.
The predictors used in Umar’s (2013) research are meant to improve testing
efficiency and assist developers in evaluating software quality and defect proneness. In
addition, predicting defects using number of components delivered can help project
managers in assigning resources, budget and rescheduling allocations. In this praxis,
number of components delivered is applied as a predictor for software defects.
2.2.3 Total Number of Requirements (TR)
Kumar and Malik (2019) proposed the logit regression model to develop a
software metrics quality testing prediction framework. The purpose of the framework was
to implement software quality testing for the organization. Software quality testing refers
to the testing of a system or components to ensure deliverables are meeting the
requirements , client expectations and exploring the system to find the defects.
It is difficult to identify all of the defects that may result in significant losses;
hence, this framework is needed to minimize the program or project cost. In Kumar and
Malik (2019) research, evaluation of the framework is explained using 18 metrics and
logit regression model. Total number of requirements was one of the 18 metrics applied
15
in Kumar and Malik (2019) research; it includes the sum of the functional and non-
functional requirements.
On a tight schedule, if the team receives too many requirements and has fewer
resources with which to work on a project, then there is higher chance of introducing
defects to the system, which may affect testing. Total requirement is applied as a
predictor for software defects prediction in this praxis.
2.2.4 Total Number of Test Cases (TTC)
Total number of test cases are the input variables or conditions that need to satisfy
a requirement is working as expected. Dhiauddin et al. (2012) and Umar (2013) proposed
that total number of test cases, along with other predictors, forecast number of defects.
Dhiauddin et al. (2012) and Umar (2013) observed that there is a strong correlation
between the number of defects and total number of test cases: “If number of test cases are
high and critical to requirements, the chances [of] getting defects is high” (Umar, 2013, p.
742). This indicates that the number of defects is directly proportional to the number of
test cases. In this praxis, total number of test cases is applied as a predictor for software
defects prediction.
2.2.5 Code Size - Lines of Code (LOC)
According to Jing et al. (2018) defect metrics plays a major role in building a
predictive analytics model that can improve the quality of software. The defect metrics
are divided into code and process metrics. Code metrics measure the complexity and size
of the source code while the process metrics deal with the complexity of the software
code development process. The complexity on the source code involves the presence of
many methods that are unnecessary on the module without reusing the code to create
16
smaller methods that can accomplish a task with minimal lines of code to improve code
readability.
Lines of code (LOC) is a code metric used to measure the size of a source code.
Huda et al. (2017) considered LOC as the amount of executable code, without including
blank lines or comments. Jing et al. (2018) concluded that having high complexity on the
source code may result in a higher likelihood of introducing defects. Zhang (2009)
discovered that, by simply using the LOC metric, one can predict software defects.
Menzies et al. (2007) discovered that code metrics are still efficient predictors of
software defects, based on the National Aeronautics and Space Administration dataset.
Dhiauddin et al. (2012) proposed code size as a measure of software complexity to
predict software defects. Code size was expressed in terms of 1,000 lines of code
(KLOC). In this research, code size (LOC) is applied as a predictor for software defects
prediction.
2.2.6 Summary of Metrics for Software Defect Prediction
Table 2-1 shows the summary and source of predictors for defect prediction.
17
Table 2-1. Summary of Predictors for Software Defect Prediction
Metrics Type
Predictors Authors & Year of Research
Process Metrics
Total Number of Developers (TNOD)
Bell et al.,2013; Bird et al., 2011; Di Nucci et al., 2018; Ostrand et al., 2010; Posnett et al., 2013; Eyolfson et al., 2011, Rahman & Devanbu, 2011
Process Metrics
Number of Components Delivered (NOCD)
Umar (2013)
Process Metrics
Total Number of Requirements (TR)
Kumar & Malik (2019)
Process Metrics
Total Number of Test Cases (TTC)
Dhiauddin et al. (2012); Umar (2013)
Code Metrics
Code Size (LOC) Dhiauddin et al. (2012); Huda et al. (2017), Jing et al. (2018); Menzies et al. (2007); Zhang (2009)
2.3 Root Causes of Software Rework
2.3.1 Introduction: Software Rework
Geshwaree and Ramdoo (2015) defined rework as an additional effort of
repeating a process due to the fact either the process was implemented incorrectly, or the
client changed the project requirements.
Many firms spend a substantial amount of rework effort (time) and money to
improve the quality of a product during the development of software. Rework can impact
the productivity of the firm; hence, it is important to identify avoidable rework effort at
an early stage of the development of a software product. According to Rubio and Gulo
18
(2015), rework is considered as one of the activities in software development and is often
misunderstood or defined poorly.
Most developers spend most of their time on avoidable rework rather than the
work that is supposed to be correct the first time. Eliminating avoidable rework seems to
be a problem in the software engineering field; as such, there is a great deal of ongoing
research in the rework research area. According to Ramdoo and Huzooree (2015) rework
can consume up to 70% of the budget allocated for a software development project.
It is difficult to eliminate rework entirely, since some software defects are
inevitable. However, we can identify and avoid rework at the early stage of software
development by avoiding project management issues like having conflicting
requirements from clients, which can introduce rework in a project. Rework is still
considered a complex and challenging problem in the software engineering field (Zahra
et al., 2014).
Morozoff (2010) and Conroy and Kruchten (2012) used metrics as an approach to
understand and reduce rework in software development. In this praxis, root causes of
software rework are identified and alternatives to reduce avoidable rework are discussed.
Some of the root causes of rework are applied as predictors for defect remediation time
prediction in section 2.4.
19
2.3.2 Root Causes Analysis of Software Rework
Ramdoo and Huzooree (2015) identified the root causes of software rework in the
Mauritius organization in the software development using Ishikawa cause and effect
methodology. They categorized the root causes of rework into the categories of:
� Ambiguous Project Requirements
� People and Testing
� History and Versioning
2.3.2.1 Ambiguous Project Requirements
According to Geshwaree and Ramdoo (2015) ambiguous requirements still
remain a problem in software development. The authors identified the following as
reasons for requirements uncertainty among members of the production team of the
Mauritius organization:
� Requirements were not defined correctly.
� Having conflicting requirements from clients or teams.
� Inability to gather requirements due to some of the team members being on a
personal leave or vacation.
� Lack of team members’ participation and involvement in the project.
� Inability to document requirement changes on shared repository.
2.3.2.2 People and Testing (Test Cases)
Ramdoo and Huzooree (2015) observed that stakeholders in the Mauritius
organization had a difficult time expressing their project needs since stakeholders
preferred to see something first to confirm their desire or even decide what they wanted.
As such, developers and clients can have misunderstandings regarding how they view
20
requirements, leading to inaccurate expectations. Stakeholders play an important role in
software development; hence, issues caused by people in the Mauritius organization
occurred due to the following reasons:
� Team underestimated the significance of requirements and design phases.
� Lack of technical insight from the team.
� Improper coding standard.
� Overworking developers led to poor code quality.
Rework is a major problem in any organization, Ramdoo and Huzooree (2015)
discovered that developers in the Mauritius organization worked under pressure due to
schedule constraints. As a result, they were not fully involved in the testing. There was
also no automated tool with which to perform regression testing; therefore, developers
provided minimal test cases and performed basic testing only. In addition, test plans were
not documented, and software defects were not properly fixed due to the tight deadline.
2.3.2.3 History and Versioning
Ramdoo and Huzooree (2015) mentioned that it was difficult to trace back all of
the code and document histories and versions because most backups were saved on a
personal developer workplace remote server. The team had to run additional ad hoc jobs
to obtain the most updated and current versions of the code. Requirements were also
documented poorly or improperly; therefore, it took more time to search and receive the
right version of a document.
21
2.3.3 Possible Ways of Reducing Avoidable Software Rework
Avoidable rework effort is defined as the effort of redoing work because the client
changed the requirements; in this situation, work needs to be redone or the system is
implemented incorrectly. Avoidable rework can be minimized if best processes,
practices, and techniques are followed. Ramdoo and Huzooree (2015) evaluated the best
practices intended to reduce avoidable rework in order to determine the degree of
appropriateness of minimizing avoidable rework.
The best practices considered were:
� Standards and procedures: Following common programming standards and
procedures on how the system should be enacted. This avoids previous
mistakes and reduces rework.
� Audits and Reviews: According to the IEEE standard for software reviews
and audits (2008), examining the system and its documentation to help
validate system quality and ensure that it meets client expectations is the
function of the audit and review process. Auditors and reviewers can help find
defects in the system, and thereby reduce rework effort in the future.
� Software Configuration Management (SCM): SCM is a process that can
trace, track, and control information concerning the software (Kim et al.,
2010). Software configuration management can reduce rework by: providing
the trace to all histories and versions of the changes made by developers in
real time, without wasting time searching for updated or historical work; using
tickets to view the history and to update the information; and avoiding
situations in which developers work on an ad hoc basis.
22
2.4 Defect Remediation Time Prediction Metrics
Section 2.3 identified the root causes of software rework. From root causes
analysis, number of requirements and number of test cases from ambiguous project
requirements and testing categories, respectively, were selected for this praxis. Total
number of requirements and total number of test cases are applied in this praxis as
predictors for defect remediation time prediction because they are more likely to cause
rework effort.
Both software defect remediation time and software rework refer to same thing,
which is the “effort [resources] required to fix software defects identified during system
testing” (Bhardwaj & Rana, 2015,p.1). According to IEEE standards, the term defect
remediation time is used throughout the praxis.
In this section, the predictors for software defect remediation time prediction are
discussed in detail based on previous studies. The purpose is to understand the predictors
to predict the time it takes to fix defects because some of the rework cannot be avoided
and is thus inevitable. The defect remediation time predictors are code size, number of
defects, total number of test cases, and total number of requirements (Goel & Singh,
2011; Ramdoo & Huzooree, 2015).
2.4.1 Code Size - Lines of Code (LOC)
Goel and Singh (2011) proposed various size-related metrics or class size (source
line of code; functional points) to predict defect remediation time. Goel and Singh (2011)
observed that the larger the class size, the more likely the software will introduce defects
which will require additional effort to fix.
23
According to Goel and Singh (2011), dataset source lines of code was significant
in predicting defect fix-effort based on correlation analysis. This may indicate that the
more lines of code one has on components, the higher the complexity in the code and the
higher the chance of introducing defects. This situation requires additional effort to fix
the defects.
In addition, Huda et al. (2017), Jing et al. (2018), Dhiauddin et al. (2012), Zhang
(2009), and Menzies et al. (2007) observed that there is strong correlation between
number of defects and source lines of code. This indicates that there is also a good chance
of a need for additional work to fix the defects. In this research, LOC is used as a
predictor for defect remediation time on closed projects, in order to conclude if LOC is
significant in predicting defect remediation time.
2.4.2 Number of Defects (NOD)
Goel and Singh (2011) indicated that number of defects is the best metric for
forecasting the defect remediation time. Therefore, the higher the number of defects in
the system, the more additional effort is required to fix the bugs in the system. This
indicates that there is a strong correlation between the number of defects and defect
remediation time based on Goel and Singh’s (2011) dataset.
2.4.3 Total Number of Test Cases (TTC)
According to Ramdoo and Huzooree (2015) having minimal number of test cases
can lead to a higher chance of introducing defects, since only basic testing is conducted in
this situation. This can require additional time out of the allocated schedule to fix defects.
Dhiauddin et al. (2012) and Umar (2013) reported that the number of defects is directly
proportional to the number of test cases.
24
This relationship indicates that, as number of test cases increases, the chance of
defects occurring also increases. When there is a high probability of introducing defects,
there is also probability of requiring additional effort to fix the defects. Hence, TTC is a
good indicator for predicting defect remediation time.
2.4.4 Total Number of Requirements (TR)
Ambiguous project requirements can cause rework effort, according to Ramdoo
and Huzooree (2015). Too many number of requirements with fewer developers can lead
to a high chance of introducing defects, which may require additional effort to fix. Kumar
& Malik (2019) reported that total number of requirements (functional and non-
functional) is one of the 18 attributes that can impact the quality of testing and introduce
defects. Every time a defect is introduced, there is an additional effort needed to fix that
defect. Hence, total number of requirements is used in this research as a predictor for
forecasting defect remediation time.
2.4.5 Summary of Metrics for Defect Remediation Time Prediction
Table 2 shows the source of predictors for defect remediation time prediction.
25
Table 2-2. Summary of Predictors for Defect Remediation Time Prediction
Metrics Type
Predictors Authors & Year of Research
Code Metrics
Code Size (LOC) Dhiauddin et al. (2012); Goel and Singh (2011), Huda et al. (2017); Jing et al. (2018); Menzies et al. (2007); Zhang (2009)
Process Metrics
Number of Defects (NOD)
Goel & Singh (2011)
Process Metrics
Total Number of Test Cases (TTC)
Dhiauddin et al. (2012); Ramdoo & Huzooree (2015); Umar (2013)
Process Metrics
Total Number of Requirements (TR)
Kumar & Malik (2019); Ramdoo & Huzooree (2015)
2.5 Summary and Conclusion
Although many studies have identified potential ways to minimize leakage of
defects and rework, it remains difficult to eliminate defects entirely. As such, some of the
rework is inevitable. According to extant literature, some rework cannot be avoided.
Previous researchers have also addressed alternatives through which to reduce rework.
While many studies have concentrated more on predicting software defects, we found out
that there are very few studies that have looked at the strategies to reduce rework and
forecasting defect remediation time.
Most studies have concentrated on using open source datasets rather than
company datasets, due to easy availability of the open source dataset. Past research
studies have used these predictors individually in their defects and defect remediation
26
time predictions. However, these studies have not considered the combined influence of
all the predictors in their predictions.
27
Chapter 3—Methodology
3.1 Introduction
In this chapter, various methodologies are used to predict software defects and
defect remediation time. The methodologies applied are categorized into regression and
classification techniques. The tool used to build the models is R language and its
packages. Felix and Lee (2017) applied regression techniques such as simple and
multiple linear regression models to predict number of software defects. Perreault (2017)
and Prasad et al. (2015) employed classification techniques such as random forest and
support vector machine models to predict number of defects.
Regression technique is a form of predictive modeling methodology, which
examines the relationship between response (dependent) and predictor (independent)
variables. The purpose of the technique is forecasting and determining the casual effect
relationship between the response and the predictor variables. The commonly used
regression techniques are:
� Multiple Linear Regression (MLR): Used to explain a relationship between a
dependent variable and more than one independent variables.
� Negative binomial Regression (NBR): Applied when the variance is greater
than the mean for over-dispersed count data.
Classification technique is a methodology in which data are categorized into a
number of classes for the purpose of predicting the class of the new data. There are many
commonly used classification techniques (Prasad et al., 2015). In this praxis, the
following models are explored in detail.
28
� Random Forest : A classification and regression algorithm made up of many
decision trees.
� Support Vector Machine: A classification and regression algorithm that
concentrates on finding the best hyperplane that divides datasets into two
classes.
3.1.1 Data
The source of the data in this Praxis is IT industry XYZ’s1 historical defects from
2016 March to 2019 November, found in user acceptance testing (UAT) environment of
XYZ databases. The dataset consists of 16 software releases; the number of releases per
year is 4 (March, May, August, and November). The dataset has 202 COBOL projects.
3.1.2 Data Description
The data include the following metrics; their abbreviations and definitions are
shown in Table 3-1.
Table 3-1. Metrics Definitions and Abbreviations
Metrics Abbreviation Definition
Total Defect Remediation Time
DRT Total time to fix the defects and expressed in hours.
Total Number of Components Delivered
NOCD Total number of modules (programs) that completed and delivered to a tester before the beginning of testing.
Code Size LOC Total source lines of code that are modified or added by a developer per the project scope.
1 XYZ is anonymous IT company
29
Total Number of Developers
TNOD Number of programmers assigned to work on component (s) / project.
Total Number of Requirements
TR Total number of requirements are the total number of project tasks that need to be completed per business ask.
Total Number of Test cases
TTC Total number of test cases are the input variables or conditions that need to satisfy a requirement is working as expected.
Number of Defects
NOD Number of defects are errors in the source code which makes the software product to function in unintended ways yielding unexpected results.
Project PR Project is “a temporary endeavor undertaken to create a unique project service or result.” (PMI, 2008, p. 434)
Release (Year.Month.Day)
RL Software release is the process of developing and delivering the final product of the software application.
3.1.3 Proposed Approaches
Many studies have employed classification and regression techniques to predict
software defects and time effort to fix the defects (Felix & Lee, 2017; Goel & Singh,
2011; Perreault, 2017). The proposed approaches are based on statistical learning models,
as shown in Figures 3-1, 3-2, and 3-3.
30
Figure 3-1. High-level Overview of Building a Model.
31
Figure 3-2. Process Flow to Identify Significant Predictors for NOD and DRT.
In this praxis, figure 3-2 is used to explain step by step approaches that are used
to determine significant predictors for software defects and defect remediation time. First,
data is imported into R from dataset archives (dataset). Second, specific metrics are
selected for defects and defect remediation time. The purpose of the metrics selection is
to ensure only predictors with a strong relationship to a target variable are selected.
Irrelevant variables are excluded. The data cleaning process is conducted to identify
incomplete rows and remove them from the dataset to reduce the model performance
impacts.
Third, using negative binomial regression (NBR) and multiple linear regression
(MLR) model results, significant predictors for the number of defects and defect
remediation time are identified. Any predictor with a probability value (p-value) of less
than 0.05 (significance level) is considered statistically significant. Fourth, partial
dependence (PD) plots are used to show the marginal effect of small number of predictors
on the response variable of the statistical learning model (Friedman, J.H, 2001; Zhao et
al., 2019). PD plots are used to explain if the relationship between the response and
32
independent variables is linear or complex. For example, when PD plots are applied to
linear regression model then the plots will show a relationship which is linear.
Figure 3-3. Process Flow for SDP and DRT Prediction.
33
Figure 3-3 depicts the step-by-step process to predict software defects and Defect
remediation time. The first step is to import data to R, the second step is to perform
feature selection and data cleansing. The third step is splitting the dataset into training,
validation, and testing categories. The fourth step is ensuring that 30% of original dataset
is used for testing and 70% of the original dataset is used for second partition, which
includes training and validation. In this step, 70% of second partition is used to train the
models and 30% is used to validate the models. The fifth step is building the random
forest and support vector machine model using a training dataset.
The sixth step is displaying the variable importance plots using random forest.
Variable importance indicates that when the important variables are removed from the
model the error increases, as compared to less important variables. The seventh step is
determining the best fit model using validation of the dataset to describe the data well,
using measure of error metrics. The applied measure of error metrics used in this praxis
are root mean square error (RMSE), mean absolute error (MAE), mean absolute
percentage error (MAPE), mean square error (MSE), and R-square. Lastly, we predict the
number of defects and defect remediation time using testing dataset (unseen data).
3.2 Regression Techniques
3.2.1 Negative Binomial Regression Model
The NBR model is a method that predicts the value of a dependent count variable
from a set of independent variables (Yu, 2012). NBR is similar to standard simple linear
regression, except NBR assumes that counts are generated from negative binomial
distribution and not from normal distribution, as presumed by the simple linear
regression.
34
In this praxis, NBR is used to analyze the relationship between number of defects
(target/dependent variable) and predictors (NOCD, TR, TTC, LOC, and TNOD). The
purpose of the NBR is to determine significant predictors for number of defects. The
value of predicted NOD is a nonnegative integer, while the predictors have numerical
value and are continuous.
Let Y be the dependent variable (number of defects; NOD) and the value of Y is k
ϵ {0,1,2,3 …}. This represents that the module has k defects. Let X1, X2, X3, X4 ….X5 be
the independent variables (NOCD, LOC, TNOD, TR, and TTC). The resulting negative
binomial NBR analysis generates probability below (Yu, 2012).
Pr (Y=k|X1, X2, X3, X4, X5…Xn) (1)
The probability (Pr) represent Y=k when X1=x1, X2=x2, .... Xn=xn. According to Yu
(2012), NBR analysis generates Equation 2, to be used to forecast the possibility of
having a number of defects in a module.
���� = = ��� ���� ���� �1 − �
� ���
� �� ��
� (2)
Yu (2012) also defined parameters as follows: “Where Γ is gamma function, λ is variance
of Y and r is the dispersion parameter” (p.64 ).
λ = exp (a+b1x1+ b2x2+ b3x3+ b4x4+ …bnxn) (3)
35
NBR models Y with the assumption the count comes from negative binomial distribution
with variance λ. The value of r and parameters (a, b1, b2, b3 …, bn) are estimated using the
maximum likelihood method.
Gamma function: Γ(n) = (n - 1)! (4)
Replacing Equations 3 and 4 into Equation 2, it becomes easy to predict the possibilities
of introducing defects in a module. To determine the most significant predictors for
number of defects using the NBR model summary, we consider all predictors with p-
value < significance level to be statistically significant. The significance level applied in
this research is 0.05.
3.2.2 Multiple Linear Regression Model
The MLR model is a method of finding the linear relationship between a response
variable (number of defects; NOD) and two or more predictor variables (Prasad et
al.,2015). It is used to understand the relationship between the output (dependent
variable) and n-independent variables (input). The independent variables for predicting
defect remediation time are lines of code, number of defects, total number of
requirements, and total number of test cases. MLR is used to predict and identify
significant predictors of a response variable. Equation 5 represents the MLR model:
Y=B0 + B1X1 +B2X2 + …. +BnXn + E (5)
where Y is the dependent variable and X1, ..., Xn are independent variables. Coefficients
b1, b2, …. bn are regression coefficients, b0 is the y-intercept (constant term), and E is
error. MLR depends on historical data to predict the values of response variable (number
of defects). In this research, MLR is used to determine the most significant predictors for
36
defect remediation time. If the p-value is less than the significance level 0.05, then the
predictors are considered statistically significant. For example, assume you have defect
remediation time factor variable X1, and Y is the predicted value of the response variable.
Using simple linear regression, one can predict (DRT) as
Y= B0 +B1X1 (6)
where B1 calculates the relationship between X1 and Y. Similarly, for more than one
predictor ranging from X1 to Xn the regression coefficients also range from B1 to Bn
(Jadhav, 2019). In order to use MLR, at least three assumptions should be met (Osborne
& Waters, 2002; Williams et al., 2013). The assumptions are:
1. There should be a linear relationship between dependent and independent
variables. Non-linearity can be fixed by transforming variables to achieve a
linear state.
2. The variables need to be normally distributed.
3. MLR requires no auto correlation in the dataset.
4. Homoscedasticity: residual is same across all levels (regression line).
5. Larger sample size tends to yield better results compared to small sample size.
6. MLR assumes little or no multi collinearity in the dataset.
3.3 Classification Techniques
3.3.1 Random Forest
Random Forest (RF) is a method that builds multiple decision trees based on
random selection of independent variables and data (Pushphavathi et al., 2014). Each
subset of data can have a different size to develop the tree, as shown in Figure 3-4. The
subsets may or may not overlap.
37
Figure 3-4. Data Subset: Random Selection of Data.
Figure 3-5. Independent Variables Set: Random Selection of Variables.
38
Assume X1 to Xn are independent variables (as shown in Figure 3-5) that can be
used to develop decision trees. At first goal, it may happen that X1, X2, X3 and some
variables are randomly selected, followed by the second goal where X4, X5, and some
variables are randomly selected. The randomly selected variables are then used to make
decision trees, which are called random decision trees. The combination of individual
random decision trees makes random forest. The four major benefits of having many
trees are addressed in the remainder of this section:
1. Most of the decision trees are usually correct. It is only some part of data that
is always wrong. This concludes most of the decision trees give correct
prediction.
2. If you conduct a poll, as shown in Figure 3-6, the observation from the first,
second, and fourth trees is Y while the third tree observation is N. According to
the majority voting process (Twala, 2011) , Y will be the ideal observation. For
classification, this means the final decision is based on a majority vote of all
decision trees; for regression, it is the average mean decision of all decision trees.
3. RF can estimate the missing value from the dataset
4. RF won’t overfit the model due to the presence of many decision trees.
39
Figure 3-6. RF Classification Process.
The following is a summary overview of how RF works:
• RF random select subset of data from training dataset.
• RF random select number of independent variables.
• Build RF model. Develop multiple of decision trees to form a forest (the
trees are not pruned).
• Create a voting to determine the most accurate outcome of prediction
based on the observations from the trees.
40
3.3.1.2 Variable Importance and Feature Selection Objectives
RF can identify important variables by ranking (features) variables based on the
level of their importance. As shown in Figure 3-3, variable importance indicates that,
when more important variables are removed from the model, the error increases as
compared to when fewer important variables are removed. The main purpose of ranking
features is to add noise to each independent variable. The calculation of variable
importance of each independent variable in the RF algorithm is as follows:
1. Utilize out-of-bag (OOB) data to compute out-of-bag error (errorOOB1) for
every decision tree in the RF. Out-of-bag data is the data which is not used to
train the decision tree. OOB is usually estimated to be one-third of the original
data. The OOB data determines the decision tree performance and its
prediction error rate is OOB error (Gao et al., 2019).
2. For OOB data, randomly add noise to the independent variable (feature) X
and compute the OOB error and mark the error as errorOOB2 (Gao et al.,
2019).
3. The variable importance score Ix of variable X for N number of trees in RF is
calculated as follows (Gao et al., 2019):
�� = ∑ �������� !������� "# $%
& (7)
The purpose of selecting the feature (independent variable) randomly is to:
� Produce better accuracy of prediction model
� Develop faster model
41
� Determine independent variables that are highly correlated with dependent
variable
3.3.2 Support Vector Machine Model
Support vector machine (SVM) algorithm is usually used as a classification and
regression technique (Shuai et al., 2013). SVM is all about finding the hyperplane which
separate the objects that have different classes (Bowes et al., 2017; Prasad et al.,2015).
Hyperplane is a decision boundary or space which separate a space into two classes.
If there are more than two lines which separate the classes (refer to Figure 3-7),
the decision is to go for data points which are close to the other side then find maximum
hyperplane which has maximum margin from both sides. The blue and yellow data
points/coordinates that are close to hyperplane are knows as support vectors (Twala,
2011). For two-dimensional, hyperplane R2 is a line while for 3-dimensiona,l hyperplane
R3 is a plane.
Figure 3-7. 2-Dimensional Hyperplane and 3-Dimensional Hyperplane.
42
Hyperplane Rn is (n-1) dimensional space where n is the number of dimensions. The
margin maximizing hyperplane (Y) equation in the n dimension is:
Y = VO + V1X1 +V2X2 +V3X3 ….
Y = VO+ VTX where VT = ViTi
Y = b + VTX (8)
Vi = VO, V1, V2, V3 are the vectors
X= Variables
VO = b = biased term
Although SVM models are mainly intended for linear classification, they can be
used for non-linear classification through the use of kernel trick. Kernel is used to
transform non-linear space into linear space by mapping low dimension data to high
dimension. There are many types of kernel functions, such as linear, polynomial,
gaussian, and radial basis. Each kernel type is suitable for a specific domain.
For predictions, SVM models separate target variable data from predictors data
using optimal hyperplane. The SVM model then transforms target and predictor data into
higher dimensional feature boundary data. The type of kernel used in this research is
radial. Radial kernel is applied when data cannot be separated in linear form, and
therefore require non-linear decision boundaries.
Using non-linear kernels, it is possible to overfit the data, provided there is a
presence of many features. Overfitting happens when the model adapts too much on the
43
training data and performs well but fails to make an accurate prediction on unseen data.
This means models perform poorly on unseen data. The opposite of overfitting is
underfitting. Underfitting refers to when the model fails to adapt to training data and
performs poorly, on both training and unseen data, resulting in poor predictions.
SVM models are also accurate and widely used due to the following benefits:
1. SVM is used to separate linear and non-linear space using the kernel trick
very quickly. It is less likely to lead to overfitting due to the presence of
hyperplane and margin liens from both sides.
2. Its complexity with number of variables is linear: for example, if you have 40
variables, the moment you double the variables to 80, its complexity (time
taken for execution) also doubles.
3. SVM can work on small datasets. It is suitable for very large datasets with
non-linear separation because its complexity with the number of records (R)
in the dataset is not double, but R3. For example, if you have 50,000 records
and decide to double the records, SVM’s complexity will not double but will
be (50,000)3.
44
Chapter 4—Results
This chapter reports results of the models based on the four hypotheses. Each
section in this chapter addresses a research question, with the purpose of providing an
approach to minimize software defects and allocate resource efforts.
4.1 Analysis of Significant Predictors for Software Defects
In this section, findings regarding research question RQ1 , hypothesis H1 and the
associated preliminary model result are presented. The purpose of hypothesis H1 is to
determine the most significant predictors for number of defects using negative binomial
regression and random forest models. The decision to use negative binomial regression
was because the variance is much higher than the mean: hence, it has greater variability.
Thus, for negative binomial regression, any predictor that achieves a p-value of less than
or equal to 0.05 is considered statistically significant in predicting software defects.
Utilizing random forest partial dependence plots, it is also possible to determine
the relationship between individual predictor and target variable. The original dataset is
used to determine important predictors for software defects (see Appendix A). Research
question RQ1 and hypothesis H1 are listed as follows:
RQ1: Code size, total number of components delivered, number of developers
working on code components, total number of requirements, and total number of test
cases are the predictors influencing the number of defects. Which predictors are
significant in predicting the number of defects?
H1: Negative binomial regression and random forest can identify the most
important predictors for number of defects.
45
4.1.1 Defects Data Collection and Cleaning
Using the R tool, 202 rows of defect data with 9 variables were collected and imported
into RStudio. These variables were: project, release (month, day, and year), number of
defects, defect remediation time, number of components delivered, number of developers,
lines of code, requirements, and test cases. Project and release are categorical variables,
and were therefore excluded from the development of the models and its predictions.
After data cleaning, the dataset had six variables and 202 rows, since three columns were
removed. The removed columns were defect remediation time, project, and release.
4.1.2 Negative Binomial Regression Summary & Partial Dependency Plots
for Defects Significant Predictors
Using the defect dataset, named Dataset1 (see Appendix C), it was possible to run
the negative binomial regression model and determine the most significant predictors for
software defects. The glm.nb function from the R package called Modern Applied
Statistics with S (MASS) was used to build negative binomial regression. To achieve this,
variance and mean of defects were calculated first to determine if negative binomial or
Poisson regression was a more appropriate model to use.
According to the result, the variance of the number of defects is 35 and the mean
is 6. Since variance is much higher than the mean, this indicates an over dispersed count
outcome. As such, negative binomial regression was used in this praxis instead of
Poisson regression to determine significant predictors for number of defects. The
following is the summary result of the NBR model, using the whole dataset (as shown in
figure 4-1.
46
Figure 4-1. NBR Model Summary Result.
First, R tool was used to perform the call and display the deviance residuals. Next,
the regression coefficients for all the independent variables, standard error, z-value, and
p-value are displayed as shown in Figure 4-1. The variable Developer.s refers to TNOD.
TNOD has a coefficient of 2.193 × 10!� and p-value of 2× 10!�). The p-value of
TNOD is less than the significance code (0.05), which indicates that TNOD is
statistically significant.
The variable Requirements refers to TR and has a coefficient of 1.959 × 10!*�
and a p-value of 0.00122 which is less than significance code (0.05). TR is statistically
significant. The variable Test.Cases refers to total number of test cases (TTC) and has a
47
p-value of 0.00492.TTC is statistically significant since its p-value is less than 0.05.
According to measure of errors metrics for NBR, mean absolute error (MAE) is 4.191,
root mean square deviation (RMSE) is 6.783, mean squared error (MSE) is 46.013, and
mean absolute percentage error (MAPE) is 60.22. The measure of errors metrics for
random forest is listed as follows: MAE equal to 0.367, RMSE equal to 0.703, MSE
equal to 0.4937, MAPE equal to 11.197. Based on measure of errors metrics, random
forest is considered the best model for identifying the significant predictors for number of
defects prediction. The significant predictors for number of defects are lines of code
(LOC), number of components delivered (NOCD) and total number of test cases (TTC).
Figures 4-2, 4-3, and 4-4 represent random forest partial dependency plots for the
significant variables: LOC, NOCD, and TTC. Figure 4-2 shows that having more than
twenty thousand lines of code changes on a project is more likely to introduce more than
8 number of defects. The more the lines of code on a project, the more likely there is high
possibility of getting the leakage of defects due to high complexity on the source code.
48
Figure 4-2. Relationship between Number of Defects and LOC.
Figure 4-3 shows that delivering more than 100 components to the User
Acceptance Testing (UAT) region indicates a higher chance of introducing more than 8
defects on a tight schedule project with fewer resources(testers) with which to work on a
project.
49
Figure 4-3. Relationship between Number of Defects and NOCD.
Figure 4-4 indicates that executing more than 350 test cases that are critical to the
project requirements leads to more than six software defects.
50
Figure 4-4. Relationship between Number of Defects and TTC
Therefore, the conclusion regarding Hypothesis H1 is that: Random Forest can
identify the most important predictors for number of defects. The significant predictors
were lines of code (LOC), number of components delivered (NOCD) and total number of
test cases (TTC).
4.2 Analysis of Software Defect Prediction Model
Random forest and support vector machine models were run using all the
predictors for software defects, as referenced in Appendix B. The objective of building
the models was to determine the best fit model that could accurately predict the number
of software defects in the upcoming release. Data partition and prediction error measures
are computed to compare the models. In this section, question RQ2 and hypothesis H2 are
addressed.
51
RQ2: How can statistical learning models forecast the number of defects using code
size, number of developers working on code components, total number of components
delivered, total number of test cases and total number of requirements?
H2: Random Forest and Support Vector Machine models can be used to predict the
number of defects.
4.2.1 Data Partition for Defects Prediction
After the completion of data cleanup (see section 4.1.1), the next step was to
partition the data. The original Dataset1 was split into two subsets: testing and second
partition. The second partition contains training and validation data. Testing data include
30% of original Dataset1, and the remaining 70% was used for training and validation.
Table 4-1 shows data partition for predicting number of defects.
Table 4-1. Data Partition for Software Defects Prediction
Dataset Partition Values
Training 104
Validation 40
Testing 58
4.2.2 Variable Importance Plot Using Defects Data
According to variable importance plot, lines of code (LOC) is more important
than number of components delivered (NOCD) such that, when LOC is removed from the
model, the error increases more than when NOCD is removed (see Figure 4-5). Random
forest can determine significant predictors using variable importance plot. For prediction,
random forest uses all variables and subset of data randomly to generate decision trees.
52
Figure 4-5. Variable Importance Plot for Software Defects.
4.2.3 Development of Software Defects Prediction
M-try is the number of random selected predictors used at each decision tree.
Using the training dataset, it was possible to build the random forest and the support
vector machine. The m-try is 3 for random forest (see Figure 4-6 and Appendix C). This
indicates that only three predictors are randomly selected and used to split the decision
trees. Radial kernel was also used for the support vector machine to map data from low to
high dimensional space (see Figure 4-7 and Appendix C).
53
Figure 4-6. Random Forest Model Result for Defects Prediction.
Figure 4-7. Support Vector Machine Model Result for Defects Prediction.
4.2.4 Measures of Model Accuracy for Software Defects Prediction
Using the validation dataset, it was possible to determine the following measure
of model accuracy for software defect prediction using random forest and support vector
statistical learning models. The measures of model accuracy used were R-square, mean
absolute error (MAE), root mean square deviation (RMSE), mean squared error (MSE),
54
and mean absolute percentage error (MAPE). Table 4-2 shows a comparison of the error
measures and determination of the best model for software defects prediction. Regardless
of which measure of error accuracy was used, random forest was the best fit model for
predicting software defects with r-square of 85.9%, based on unseen data.
Table 4-2. Measure of Errors for Software Defect Prediction
Dataset Type Measure of Errors Random Forest Support Vector Machine
Training R-Squared 0.934 0.544
Validation MAE 0.739 1.86
MAPE 20.75% 35.74%
MSE 1.457 15.77
RMSE 1.2 3.97
Testing R-Squared 0.859 0.688
4.2.5 Results of Software Defects Prediction Model
NBR analysis can predict multiple number of software defects in one component
but it is not effective in predicting fault-prone modules (Yu, 2012). According to measure
of errors metrics for NBR, mean absolute error (MAE) is 4.48, root mean square
deviation (RMSE) is 6.62, mean squared error (MSE) is 43, and mean absolute
percentage error (MAPE) is 62.4 (see Appendix D). Based on the measure of errors
metrics for NBR and table 4-2 results, NBR is not as effective as random forest in
predicting number of software defects.
55
Using the best model random forest, it is possible to predict software defects using
unseen data, which is categorized as testing dataset to forecast number of defects in an
upcoming release. Figure 4-8 shows the actual (observed) and the predicted defects. In
addition, Figure 4-9 shows how far the predicted defects have deviated from the actual
defects (prediction error). For example, the first row shows that the actual and predicted
defects are 1 and 1 respectively. The residual can be estimated to 0 hours. The residual is
the difference between the predicted and the actual values.
Figure 4-8. Number of Predicted Defects vs. Actual Defects.
56
Figure 4-9. Actual Versus Predicted Defects Graph.
Therefore, the conclusion regarding Hypothesis H2 was that: Based on the
measure of error values (MAE, MAPE, RMSE, MSE, and R-Square), random forest was
the best model for predicting the number of defects with r-square of 85.9%, as compared
to the support vector machine, which had r-square of 68.8% based on unseen data (testing
dataset).
4.3 Analysis of Significant Predictors for Software Defect Remediation Time
In this section, hypothesis H3 and the model result are presented. The purpose of
hypothesis H3 is to determine the most significant predictors for defect remediation time
using multiple linear regression (MLR). Using MLR, any predictor that achieves a p-
value of less or equal to 0.05 is considered statistically significant in predicting defect
remediation time. The original dataset was used to determine important predictors for
defect remediation time (see Appendix A). This section addresses the research question
57
RQ3 and hypothesis H3 and provides results of data collection, data cleaning, and model
result.
RQ3: Code size, number of defects, total number of requirements, and total
number of test cases are the predictors influencing the defect remediation time prediction.
Which predictors are significant in predicting defect remediation time?
H3: Multiple linear regression can identify the important predictors for defect
remediation time.
4.3.1 Data Collection and Cleaning for Defect Remediation Time
Using R tool, 202 rows of defect remediation time data with nine variables were
collected and imported into RStudio. After data cleansing, the dataset had five variables
and 202 rows, since four columns are removed. The removed columns were: project,
release, number of developers, and components delivered. The predictors used for defect
remediation time prediction are number of defects, lines of code, requirements, and test
cases.
4.3.2 Multiple Linear Regression (MLR) Model Summary
Using the dataset named Dataset2 the multiple linear regression model was run to
determine the most significant predictors for software defect remediation time (see
Appendix C). The lm function from the MASS R package was used to build the multiple
linear regression model. The results of the MLR model are shown in Figure 4-10.
58
Figure 4-10. Multiple Linear Regression Model Result
First, the R tool was used to perform the call and display the deviance residual.
Next, the regression coefficients for all of the independent variables containing standard
error, x-value and p-value are displayed as shown on the above. The variable defects
refers to NOD. NOD has a coefficient of 2.823 and p-value of 0.00000000182. The p-
value of NOD is less than the significance code (0.05) which indicate NOD is statistically
significant. The R-square of MLR is 77% which indicate the measure of how close the
data points are fitted close to the regression line.
4.3.3 Partial Dependency Plots for Significant Predictor(s) of Defect
Remediation Time
Figure 4-11 shows the random forest partial dependency plot for NOD. According
to the PD plot, having more than seven defects in a highly complex system requires more
than 25 hours to fix. The higher the number of defects, the more the effort required to fix
the defects.
59
Figure 4-11. Relationship between Defect Remediation Time and NOD.
The conclusion regarding Hypothesis H3 was that: Multiple linear regression can
be used to identify the most important predictor for defect remediation time with the r-
square of 77%. The significant predictor is number of defects.
60
4.4 Analysis of Defect remediation Time Prediction Model
Random forest and support vector machine models were run using all the
predictors for defect remediation time (see Appendix B). The objective of building the
models was to determine the best fit model that could accurately predict the defect
remediation time in the upcoming release. Data partition and prediction error measures
were computed to compare the models. In this section, question RQ4 and hypothesis H4
are addressed.
RQ4: How can statistical learning models forecast the defect remediation time
using code size, number of defects, total number of requirements, and total number of test
cases?
H4: Random forest and support vector machine models can be used to predict
defect remediation time.
4.4.1 Data Partition for Defect Remediation Time Prediction
After the completion of data cleanup (see section 4.3.1), the next step was to
partition the data. The original Dataset2 was split into two subsets: testing and second
partition. The second partition contained training and validation data. Testing data
contained 30% of original Dataset2 and the remaining 70% was used for training and
validation. Table 4-3 shows the values of each dataset type.
Table 4-3. Data Partition for Defect Remediation Time Prediction
Dataset Partition Values
Training 103
Validation 40
Testing 59
61
4.4.2 Variable Importance Plot Using Defect Remediation Time Data
Number of Defects (NOD) was found to be more important than lines of code
(LOC) such that, when NOD is removed from the model, the error increases more than
when LOC is removed (see Figure 4-12).
Figure 4-12. Variable Importance Plot for Defect Remediation Time.
4.4.3. Development of Software Defect Remediation Time Prediction
Using the training dataset for DRT prediction indicated that only three predictors
were randomly selected and used to split the decision trees (see Figure 4-13 and
Appendix C). Radial kernel was used for the support vector machine.
62
Figure 4-13. Random Forest Model Result for DRT Prediction.
Figure 4-14. Support Vector Machine Model Result for DRT Prediction.
4.4.4 Model Accuracy Measures for Defect Remediation Time
Using the validation dataset, we can determine the following measure of model
accuracy for defect remediation time prediction based on random forest and support
vector statistical learning models (see Appendix D). The measures of model accuracy
used were R-square, mean absolute error (MAE), root mean square deviation (RMSE),
63
mean squared error (MSE), and mean absolute percentage error (MAPE). Table 4-4
shows a comparison of the measure of errors and the best model for defect remediation
time prediction. Random forest was the best fit model for predicting defect remediation
time with r-square of 70.9%.
Table 4-4. Measure of Errors for Defect Remediation Time Prediction
Dataset Type Measure of Errors Random Forest Support Vector Machine
Training R-Squared 0.83 0.49
Validation MAE 4.3 5.39
MAPE 95.86% 119.85%
MSE 29.367 59.28
RMSE 5.42 7.69
Testing R-Squared 0.709 0.392
4.4.5 Results of Defect Remediation Time Prediction Model
According to the best model random forest, it was possible to predict defect
remediation time using unseen data, which is categorized as the testing dataset, to
forecast the time it takes to fix defects in an upcoming release. Figure 4-15 shows the
actual (observed) and predicted defect remediation time (see Appendix C). In addition,
Figure 4-16 shows how far the predicted defect remediation time data deviated from the
actual defect remediation time (prediction error). For example, the first row shows the
64
actual and predicted defect remediation time are 1 and 3 hours respectively. The residual
can be estimated to 2 hours (3 minus 1 = 2).
Figure 4-15. Predicted Defect Remediation Time vs. Actual Defect Remediation
Time.
65
Figure 4-16. Actual vs. Predicted Defect Remediation Time Graph.
Therefore, the conclusion regarding hypothesis H4 was that: Based on the measure
of errors metrics (MAE, MAPE, RMSE, MSE, and R-Square), random forest was the best
model in predicting defect remediation time with r-square of 70.9%, as compared to
support vector machine, which had an r-square of 39.2% based on unseen data (testing
dataset). Table 4-5 shows the summary of the hypotheses tests results and answers to
research questions.
66
Table 4-5. Summary Table
Hypothesis
Number
Hypotheses Results Research Questions
Results
H1 RF can identify the most
important predictors for number
of defects
The significant predictors
for number of defects are
lines of code (LOC),
number of components
delivered (NOCD) and
total test cases (TTC)
H2 NBR is not as effective as RF in
predicting the number of software
defects
Using RF, it is possible
to predict number of
defects using unseen
data, which is
categorized as testing
dataset
H3 MLR can identify the most
important predictors for defect
remediation time
The significant predictor
for defect remediation
time is number of
defects
H4 SVM is not as effective as RF in
predicting the defect remediation
time
Using RF, it is possible
to predict defect
remediation time using
unseen data, which is
categorized as testing
dataset.
67
Chapter 5—Discussion and Conclusions
5.1 Discussion and Conclusions
The goal of the praxis was to predict software defects and defect remediation time
in order to minimize the leakage of defects and allocate rework efforts in upcoming
software release. In order to perform the forecast, extant literature on predictors for
software defects and defect remediation time was reviewed. The literature review
provided insight on the type of predictors to be applied in this praxis, as well as
methodologies (negative binomial regression, random forest, support vector machine, and
multiple linear regression) used to test the hypotheses. Previous studies have been
conducted to investigate the circumstances, prior to testing, under which developers,
requirements, test cases, lines of code, and components delivered tend to introduce
defects. However, none considered using the combined influence of all the predictors to
predict software defects and defect remediation time.
By analyzing the XYZ company dataset from the past 16 software releases
containing 202 software projects, significant predictors influencing the number of defects
and defect remediation time were identified. The predictors were considered significant
when the p-value of each variable was less than the significance code 0.05 (alpha). The
significant predictors influencing the number of defects were total number of test cases,
total number of developers, and total number of requirements . The significant predictor
for defect remediation time was number of defects. Partial dependency plots were also
applied to determine the marginal effect of the predictor(s) on the number of defects and
68
defect remediation time. The Partial dependency plots showed there was strong
correlation between the significant predictors and response variable.
The following summary shows the marginal effect of each significant predictor on
the specific response variable:
� If there are more than seven developers working on a single module or
project, the chances of having more than six defects is high, due to the lack of
developer component focus and ownership. At same time, having more than
seven developers working on the agile project can enable developers to find
more defects, which can minimize the leakage of defects in the release.
� Umar (2013) suggested that “If number of test cases are high and critical to
requirements, the chances [of] getting defects is high” (p. 742). The results of
this praxis indicate that executing more than 350 total number of test cases,
which are critical to project requirements, correlates with a high chance of
introducing more than six defects.
� A project with more than 40 number of requirements on a tight schedule has a
high chance of introducing more than six defects.
� If there are more than seven defects found from the complex systems, then the
total time to fix the defects will be more than 25 hours. Hence, the higher the
number of defects, the more effort is required to fix them.
The results of this praxis suggest that it is feasible to predict defect remediation time after
identifying the number of predicted defects. Therefore, it is recommended to forecast
number of defects and defect remediation time predictions at R-squares of 85.9% and
70.9%, respectively, using the random forest model.
69
5.2 Contributions to Body of Knowledge
1. Identifying the significant predictors (number of developers, total number of
requirements, total number of test cases, lines of code, and components
delivered) for software defects in order to analyze the effect of each
significant predictor on the number of defects.
2. Demonstrating a methodology for predicting software defects using the
combined influence of all the predictors (number of developers, total number
of requirements, total number of test cases, lines of code, and components
delivered) in order to minimize the leakage of defects in an upcoming
software release.
3. Identifying the significant predictors (number of defects, lines of code, total
number of test cases, and total number of requirements) of defect remediation
time in order to analyze the effect of each significant predictor on the defect
remediation time.
4. Demonstrating a methodology for predicting software defect remediation time
using the combined influence of all the predictors (number of defects, lines of
code, total number of test cases and total number of requirements) in order to
allocate rework efforts in an upcoming software release.
5.3 Recommendations for Future Research
According to the model results and predictors for software defects and defect
remediation time predictions, the following are the recommendations for future research
improvements:
70
� The data used in this praxis was based on COBOL projects, and the model
results are specific to mainframe application systems. Hence, extending the
model to work on Java, C++, PHP, Perl, and Python datasets will enable the
model to work on various technology platforms.
� Currently, the model predicts software defects and defect remediation time
prior to testing. Extending the model to predict defects and defect remediation
time prior to production install will help minimize the leakage of defects.
� Extending the model to include an acceptable number of software defects in
testing
� Considering that the current dataset has 202 projects, it is a relatively small
dataset. Extending the model to predict defects and defect remediation time
based on large dataset consisting of around 50,000 projects—like the standish
chaos database (The Standish Group, 2013)—will improve measures of model
performance and accuracy.
71
References
Akbarinasaji, S., Caglayan, B., & Bener, A. (2018). Predicting bug-fixing time: A
replication research using an open source software project. Journal of Systems
and Software, 136, 173–186. https://doi.org/10.1016/j.jss.2017.02.021
Bell, R., Ostrand, T., & Weyuker, E. (2013). The limited impact of individual developer
data on software defect prediction. Empirical Software Engineering, 18(3), 478–
505. https://doi.org/10.1007/s10664-011-9178-4
Bhardwaj, M., & Rana, A. (2015). Impact of size and productivity on testing and rework
efforts for web-based development projects. ACM SIGSOFT Software
Engineering Notes, 40(2), 1–4. https://doid.org/10.1145/2735399.2735404
Bird, C., Nagappan, N., Murphy, B., Gall, H., & Devanbu, P. (2011). Don’t touch my
code! Examining the effects of ownership on software quality. SIGSOFT/FSE
2011 - Proceedings of the 19th ACM SIGSOFT Symposium on Foundations of
Software Engineering, 4–14. https://doi.org/10.1145/2025113.2025119
Bowes, D., Hall, T., & Petrić, J. (2017). Software defect prediction: do different
classifiers find the same defects? Software Quality Journal, 26(2), 525–552.
https://doi.org/10.1007/s11219-016-9353-3
Conroy, P. & Kruchten, P. (2012). Performance norms: An approach to rework reduction
in software development. Electrical & Computer Engineering (CCECE).
Dhiauddin, M., Suffian, M., & Ibrahim, S. (2012). A prediction model for system testing
defects using regression analysis. JSCSE, 2(7), 55–68.
https://doi.org/10.7321/jscse.v2.n7.6
72
Di Nucci, D., Palomba, F., De Rosa, G., Bavota, G., Oliveto, R., & De Lucia, A. (2018).
A developer centered bug prediction model. IEEE Transactions on Software
Engineering, 44(1), 5–24. https://doi.org/10.1109/TSE.2017.2659747
Eyolfson, J., Tan, L., & Lam, P. (2011). Do time of day and developer experience affect
commit bugginess. Proceedings - International Conference on Software
Engineering, 153–162. https://doi.org/10.1145/1985441.1985464
Fan, G., Diao, X., Yu, H., Yang, K., & Chen, L. (2019). Software defect prediction via
attention-based recurrent neural network. Scientific Programming, 2019, 1–14.
https://doi.org/10.1155/2019/6230953
Felix, E., & Lee, S. (2017). Integrated approach to software defect prediction. IEEE
Access, 5, 21524–21547. https://doi.org/10.1109/ACCESS.2017.2759180
Friedman, J.H. (2001). Greedy function approximation : A gradient boosting machine.
Annals of statistics,1189-1232. https://doi.org/10.1214/aos/1013203451
Gao, X., Wen, J., & Zhang, C. (2019). An improved random forest algorithm for
predicting employee turnover. Mathematical Problems in Engineering, 2019, 1–
12. https://doi.org/10.1155/2019/4140707
Geshwaree, H., & Ramdoo, V. (2015). A systematic research on requirement engineering
processes and practices in Mauritius. International Journal of Advanced Research
in Computer Science and Software Engineering, 5(2), 40–46.
Goel, B., & Singh, Y. (2011). An empirical analysis of metrics to predict the software
defect fix-effort. International Journal of Computers and Applications, 33(2).
https://doi.org/10.2316/journal.202.2011.2.202-2749
73
Harekal, D., and Suma, V. (2015). Implication of post production defects in software
industries. International Journal of Computer Applications, 109(17), 20–23.
https://doi.org/10.5120/19419-1032
Huda, S., Alyahya, S., Mohsin Ali, M., Ahmad, S., Abawajy, J., Al-Dossari, H., &
Yearwood, J. (2017). A framework for software defect prediction and metric
selection. IEEE Access, 6(99), 2844–2858.
https://doi.org/10.1109/ACCESS.2017.2785445
IEEE Software 2008 Editorial Calendar. (2008). IT Professional, 10(2), 18–18.
https://doi.org/10.1109/mitp.2008.30
Jadhav, R. B. (2019). A software defect learning and analysis utilizing regression method
for quality software development. International Journal of Advanced Trends in
Computer Science and Engineering, 1275–1282.
https://doi.org/10.30534/ijatcse/2019/38842019
Kim, D.-Y., & Youn, C. (2010). Traceability enhancement technique through the
integration of software configuration management and individual working
environment. Secure Software Integration and Reliability Improvement (SSIRI),
Fourth International Conference on IEEE.
Kumar, S., & Malik, K. (2019). Software metrics quality testing (SMQT) prediction
using logit regression model. International Journal of Computer Applications,
178(30), 1–4. https://doi.org/10.5120/ijca2019919114
Li, Z., Jing, X., & Zhu, X. (2018). Progress on approaches to software defect prediction.
Iet Software, 12(3), 161–175. https://doi.org/10.1049/iet-sen.2017.0148
74
Menzies, T., Greenwald, J., & Frank, A. (2007). Data mining static code attributes to
learn defect predictors. IEEE Transactions on Software Engineering, 33(1), 2–13.
https://doi.org/10.1109/TSE.2007.256941
Morozoff, E. (2010). Using a line of code metric to understand software rework. IEEE
Software, 27(1), 72–77. https://doi.org/10.1109/ms.2009.160
Osborne, J. W., & Waters, E. (2002). Four assumptions of multiple regression that
researchers should always test. Practical Assessment, Research & Evaluation,
8(2). https://doi.org/10.7275/r222-hv23
Ostrand, T., Weyuker, E., & Bell, R. (2010). Programmer-based fault prediction. ACM
International Conference Proceeding Series, 1–10.
https://doi.org/10.1145/1868328.1868357
Perreault, L. (2017). Using classifiers for software defect detection [Conference paper].
26th International Conference on Software Engineering and Data Engineering,
SEDE.
Posnett, D., D’Souza, R., Devanbu, P., & Filkov, V. (2013). Dual ecological measures of
focus in software development. Proceedings - International Conference on
Software Engineering, 452–461. https://doi.org/10.1109/ICSE.2013.6606591
Prasad, M. C. M., Florence, L. F., & Arya3, A. (2015). A research on software metrics
based software defect prediction using data mining and statistical learning
techniques. International Journal of Database Theory and Application, 8(3), 179–
190. https://doi.org/10.14257/ijdta.2015.8.3.15
Project Management Institute. (2008). A guide to the project management body of
knowledge (PMBOK® guide) (4th ed.). Author.
75
Pushphavathi, T. P., Suma, V., & Ramaswamy, V. (2014). A novel method for software
defect prediction: Hybrid of FCM and random forest. 2014 International
Conference on Electronics and Communication Systems (ICECS).
https://doi.org/10.1109/ecs.2014.6892743
Rahman, F., & Devanbu, P. (2011). Ownership, experience and defects: A fine-grained
research of authorship. Proceedings - International Conference on Software
Engineering, 491–500. https://doi.org/10.1145/1985793.1985860
Ramdoo, V. D., & Huzooree, G. (2015). Strategies to reduce rework in software
development on an organisation in Mauritius. International Journal of Software
Engineering & Applications, 6(5), 09–20. https://doi.org/10.5121/ijsea.2015.6502
Rubio, R. P. M. T., & Gulo, C. A. (2015). Characterizing developers’ rework on GitHub
open source projects. Doctoral Symposium in Informatics Engineering.
Shuai, B., Li, H., Li, M., Zhang, Q., & Tang, C. (2013). Software defect prediction using
dynamic support vector machine. 2013 Ninth International Conference on
Computational Intelligence and Security. https://doi.org/10.1109/cis.2013.61
The Standish Group. (2013). The chaos manifesto. The Standish Group.
Twala, B. (2011). Predicting software faults in large space systems using statistical
learning techniques. Defence Science Journal, 61(4), 306–316.
https://doi.org/10.14429/dsj.61.1088
Umar, S. N. (2013). Software testing defect prediction model - a practical approach.
International Journal of Research in Engineering and Technology, 2(5), 741–745.
https://doi.org/10.15623/ijret.2013.0205001
76
Williams, M., Grajales, C., & Kurkiewicz, D. (2013). Assumptions of multiple
regression: Correcting two misconceptions. Practical Assessment, Research and
Evaluation, 18(9), 1–14.
Yu, L. (2012). Using negative binomial regression analysis to predict software faults: A
research of Apache Ant. International Journal of Information Technology and
Computer Science, 4(8), 63–70. https://doi.org/10.5815/ijitcs.2012.08.08
Zahra, S., Nazir, A., Khalid, A., Raana, A., & Nadeem Majeed, M. (2014). Performing
inquisitive research of PM traits desirable for project progress. International
Journal of Modern Education and Computer Science, 6(2), 41–47.
https://doi.org/10.5815/ijmecs.2014.02.6
Zhang, H. (2009). An investigation of the relationships between lines of code and defects.
IEEE International Conference on Software Maintenance, ICSM, 274–283.
https://doi.org/10.1109/ICSM.2009.5306304
Zhao, Q., & Hastie, T. (2019). Causal Interpretations of Black-Box Models. Journal of
Business & Economic Statistics, 1–10.
https://doi.org/10.1080/07350015.2019.1624293
77
Appendix A—Dataset for Defects and Defect Remediation Time
The proprietary defect and defect remediation time data came from XYZ firm and
contained 202 COBOL projects and 16 releases starting from 2016 to 2019. The
following R source code and global environment provided data collection, cleaning
process for defect and defect remediation time prediction.
1. Source Code for Data import and cleanup:
#Import Data to RStudio:
Dataset <- readXL("C:/Users/T-sus/Desktop/data.xlsx", rownames=FALSE,
header=TRUE, na="", sheet="CRDB", stringsAsFactors=TRUE)
#Perform Data Cleanup Process:
Dataset1 <- subset(Dataset, select=c(Components.Delivered,Defects,Developer.s.,
Lines.of.Code,Requirements,Test.Cases))
Dataset2 <- subset(Dataset, select=c(Defect.Remediation.Time.In.hours,Defects,
Lines.of.Code,Requirements,Test.Cases))
� Global environment: Defects Prediction
According to the global environment, the dataset has 202 projects and 9 variables
before cleanup. After data cleansing the dataset had 6 variables and 202 projects since
78
three columns are removed for defects prediction. The removed columns were defect
remediation time, project and release.
For defect remediation time prediction, the dataset had 5 variables and 202
projects after data cleanup process. Four variables were removed for defect remediation
time prediction. The removed columns are project, release, number of developers and
components delivered.
Figure A-1. Defects.
Figure A-2. Defect Remediation Time.
79
Appendix B—Metrics for Defects and Defect Remediation time
Table B-1 represents the definition and abbreviation of each metric as applied on the
praxis.
Table B-1. Metrics Definitions and Abbreviations
Metrics Abbreviation Definition
Total Defect Remediation Time
DRT Total time to fix the defects and expressed in hours.
Total Number of Components Delivered
NOCD Total number of modules (programs) that are completed and delivered to a tester before the beginning of testing.
Code Size LOC Total source lines of code that are modified or added by a developer per the project scope.
Total Number of Developers
TNOD Number of programmers assigned to work on component (s) / project.
Total Number of Requirements
TR Total number of requirements are number of project tasks that need to be completed per business ask.
Total Number of Test cases
TTC Total number of test cases are the input variables or conditions that need to satisfy a requirement is working as expected.
Number of Defects
NOD Number of defects are errors in the source code which makes the software product to function in unintended ways yielding unexpected results.
Project PR Project is “a temporary endeavor undertaken to create a unique project service or result.” (PMI, 2008, p. 434)
Release (Year.Month.Day)
RL Software release is the process of developing and delivering the final product of the software application.
80
Tables B-2 and B-3 provide a list of predictors which were applied on this research for
predicting number of defects and defect remediation time.
Table B-2. Summary of Predictors for Software Defect Prediction
Metrics Type
Predictors Authors and Year of Research
Code Metrics
Code Size (LOC) Dhiauddin et al. (2012); Huda et al. (2017); Jing et al. (2018); Menzies et al. (2007); Zhang (2009)
Process Metrics
Total Number of Test Cases (TTC)
Dhiauddin et al. (2012); Umar (2013)
Process Metrics
Total Number of Requirements (TR)
Kumar & Malik (2019)
Process Metrics
Number of Components Delivered (NOCD)
Umar (2013)
Process Metrics
Number of Developers (NOD) Working On Code Components
Bell et al. (2013); Bird et al. (2011); Di Nucci et al. (2018); Ostrand et al. (2010); Posnett et al. (2013); Rahman & Devanbu (2011); Eyolfson et al. (2011)
81
Table B-3. Summary of Predictors for Defect Remediation Time Prediction
Metrics Type
Predictors Authors & Year Of Research
Code Metrics
Code Size (LOC) Dhiauddin et al. (2012); Goel & Singh (2011); Huda et al. (2017); Jing et al. (2018); Menzies et al. (2007); Zhang (2009)
Process Metrics
Total Number of Test Cases (TTC)
Dhiauddin et al. (2012); Ramdoo & Huzooree (2015); Umar (2013)
Process Metrics
Total Number of Requirements (TR)
Kumar & Malik (2019); Ramdoo & Huzooree (2015)
Process Metrics
Number of Defects (NOD)
Goel & Singh (2011)
82
Appendix C—Models Development & Results
This appendix addresses four research questions and hypotheses by providing the source
code and models results.
1. Analysis of Significant Predictors for Software Defects.
RQ1: Code size, total number of components delivered, number of developers
working on code components, total number of requirements and total number of
test cases are the predictors influencing the number of defects. Which predictors
are significant in predicting the number of defects?
H1: Negative binomial regression and random forest can identify the most
important predictors for number of defects.
� Source Code to build NBR model:
# Require MASS package before running glm.nb
library(MASS)
require(MASS)
# Build NBR model based on overall data
mymodel <- glm.nb(Defects ~ Components.Delivered + Developer.s. +
Requirements + Test.Cases + Lines.of.Code, data = Dataset1)
summary(mymodel)
83
Results: According to figure C-1, the significant predictors are total number of
developers TNOD, total number of requirements TR and total number of test
cases TTC.
Figure C-1. NBR Model Summary Result.
� Build Random Forest Model Using whole dataset
rf.fit <- train(Defects ~ Components.Delivered + Developer.s. + Requirements +
Test.Cases + Lines.of.Code, data = dataset1, method = "rf",
importance = TRUE)
rf.fit
84
� Variable Importance Plot For Significant Predictors For Number of Defects
varImp(rf.fit)
Results: Using random forest variable importance plot, the significant predictors
for number of defects are lines of code, total number of test cases and number of
components delivered with overall percentage of 100, 78.91 and 73.42
respectively. According to measure of errors metrics for NBR, mean absolute
error (MAE) is 4.191, root mean square deviation (RMSE) is 6.783, mean squared
error (MSE) is 46.013, and mean absolute percentage error (MAPE) is 60.22. The
measure of errors metrics for random forest are listed as follow: MAE equal to
0.367, RMSE equal to 0.703, MSE equal to 0.4937, MAPE equal to 11.197.
Based on measure of errors metrics, random forest is considered the best model
for identifying the significant predictors for number of defects prediction.
2. Analysis of Software Defect Prediction Model.
RQ2: How can statistical learning models forecast the number of defects using code
size, number of developers working on code components, total number of
components delivered, total number of test cases and total number of requirements?
H2: Random Forest and Support Vector Machine models can be used to predict the
number of defects.
85
� Source Code to build RF and SVM model:
#Build Random Forest Model Using training dataset
rf.fit <- train(Defects ~ Components.Delivered + Developer.s. + Requirements +
Test.Cases + Lines.of.Code, data = training, method = "rf", importance
= TRUE)
rf.fit
#Build Support Vector machine model Using training dataset
svm.fit <- train(Defects ~ Components.Delivered + Developer.s. + Requirements
+
Test.Cases + Lines.of.Code, data = training, method = "svmRadial")
svm.fit
# Number of Defects Prediction Using RF
Actualpredicted <- predict(rf.fit, newdata = testing)
PredictedDefects <- round(Actualpredicted, 0)
ActualDefects <- testing$Defects
View(data.frame(ActualDefects,PredictedDefects))
plot(ActualDefects,PredictedDefects)
Results: Figure C-2 represents the actual (observed) and the predicted defects. Random
Forest was considered the best model in predicting the number of defects with r-square of
85.9%
86
Figure C-2. Number of Predicted Defects vs. Actual Defects.
3. Analysis of Significant Predictors for Software Defect Remediation Time.
RQ3: Code size, number of defects, total number of requirements and total number
of test cases are the predictors influencing the defect remediation time prediction.
Which predictors are significant in predicting defect remediation time?
H3: Multiple linear regression can identify the important predictors for defect
remediation time.
87
� Source Code to build MLR model:
# Require MASS package before running lm
library(MASS)
require(MASS)
# Build Multiple linear regression and provide summary for significant predictors
LinearModel.1 <- lm (Defect.Remediation.Time.In.hours ~ Defects +
Requirements + Test.Cases + Lines.of.Code, data = Dataset2)
summary(LinearModel.1)
� Results: According to figure C-3, the significant predictor for Defect
Remediation time is number of effects NOD).
Figure C-3. Multiple Linear Regression Model Result.
88
4. Analysis of Defect Remediation Time Prediction Model.
R4: How can statistical learning models forecast the defect remediation time using
code size, number of defects, total number of requirements and total number of test
cases?
H4: Random Forest and Support Vector Machine models can be used to predict
defect remediation time.
� Source Code to build RF and SVM model:
#Build Random Forest Model Using training dataset
rf.fit <- train(Defect.Remediation.Time.In.hours ~ Defects + Requirements +
Test.Cases + Lines.of.Code, data = training, method = "rf", importance
= TRUE)
rf.fit
#Build Support Vector machine model Using training dataset
svm.fit <- train(Defect.Remediation.Time.In.hours ~ Defects + Requirements +
Test.Cases + Lines.of.Code, data = training, method = "svmRadial")
svm.fit
# Number of Defects Prediction Using RF
Actualpredicted <- predict(rf.fit, newdata = testing)
PredictedDefectsTime <- round(Actualpredicted, 0)
ActualDefectsTime <- testing$Defect.Remediation.Time.In.hours
89
View(data.frame(ActualDefectsTime,PredictedDefectsTime))
plot(ActualDefectsTime,PredictedDefectsTime)
Results: Figure C-4 represents the actual (observed) and the predicted defect remediation
time Random Forest was considered the best model in predicting the defect remediation
time with r-square of 70.9%
Figure C-4. Predicted Defect Remediation Time vs. Actual Defect Remediation
Time.
90
Appendix D—Measures of Model Performance
This appendix provides the results of measure of errors from defects and defect
remediation time prediction. The source code and results are displayed as follows.
� Source Code for RF and SVM- Defects.
#Using Training Dataset
RF:
rf.fit <- train(Defects ~ Components.Delivered + Developer.s. + Requirements +
Test.Cases + Lines.of.Code, data = training, method = "rf", importance =
TRUE)
rf.fit
SVM:
svm.fit <- train(Defects ~ Components.Delivered + Developer.s. + Requirements +
Test.Cases + Lines.of.Code, data = training, method = "svmRadial") #
svmLinear svmRadial
svm.fit
NBR:
mymodel <- glm.nb(Defects ~ Components.Delivered + Developer.s. + Requirements +
Test.Cases + Lines.of.Code, data = training)
91
summary(mymodel)
#Using Validation Dataset
RF:
residuals.rf <- predict(rf.fit, newdata = validation) - validation$Defects
MSE.rf <- (mean(residuals.rf^2))
RMSE.rf <- sqrt(mean(residuals.rf^2))
MAE.rf <- mean(abs(residuals.rf))
residuals3.rf <-(validation$Defects - predict (rf.fit, newdata = validation))
MAPE.rf <- mean(abs((residuals3.rf/validation$Defects)*100),na.rm= TRUE)
SVM:
residuals.svm <- predict(svm.fit, newdata = validation) - validation$Defects
MSE.svm <- (mean(residuals.svm^2))
RMSE.svm <- sqrt(mean(residuals.svm^2))
MAE.svm <- mean(abs(residuals.svm))
residuals3.svm <-(validation$Defects - predict(svm.fit, newdata = validation))
MAPE.svm <- mean(abs((residuals3.svm/validation$Defects)*100),na.rm= TRUE)
92
NBR:
residuals.nb <- predict(mymodel, newdata = validation) - validation$Defects
residuals3.nb <-(validation$Defects - predict (mymodel, newdata = validation))
MAPE.nb <- mean(abs((residuals3.nb/validation$Defects)*100),na.rm= TRUE)
MSE.nb <- (mean(residuals.nb^2))
RMSE.nb <- sqrt(mean(residuals.nb^2))
MAE.nb <- mean(abs(residuals.nb))
#Using Testing Dataset
RF:
rf.fit <- train(Defects ~ Components.Delivered + Developer.s. + Requirements +
Test.Cases + Lines.of.Code, data = testing, method = "rf", importance =
TRUE)
rf.fit
SVM:
svm.fit <- train(Defects ~ Components.Delivered + Developer.s. + Requirements +
Test.Cases + Lines.of.Code, data = testing, method = "svmRadial") #
svmRadial
svm.fit
93
� Results:
According to measure of errors metrics for NBR, mean absolute error (MAE) is 4.48,
root mean square deviation (RMSE) is 6.62, mean squared error (MSE) is 43, and mean
absolute percentage error (MAPE) is 62.4. Using measure of errors metrics for NBR and
Table D-1, the result shows random forest was the best fit model for predicting defects
with r-square of 85.9%.
Table D-1. Measure of Errors for Software Defects Prediction
Dataset Type Measure of Errors Random Forest Support Vector Machine
Training R-Square 0.934 0.544
Validation
MAE 0.739 1.86
MAPE 20.75% 35.74%
MSE 1.457 15.77
RMSE 1.2 3.97
Testing R-Square 0.859 0.688
� Source Code for RF and SVM- Defect Remediation Time.
#Using Training Dataset
RF:
rf.fit <- train(Defect.Remediation.Time.In.hours ~ Defects + Requirements +
Test.Cases + Lines.of.Code, data = training, method = "rf", importance =
TRUE)
rf.fit
94
#Using Validation Dataset
RF:
residuals.rf <- predict(rf.fit, newdata = validation) -
validation$Defect.Remediation.Time.In.hours
residuals4.rf <-((validation$Defect.Remediation.Time.In.hours - predict(rf.fit, newdata =
validation))/validation$Defect.Remediation.Time.In.hours)*100
MSE.rf <- (mean(residuals.rf^2))
RMSE.rf <- sqrt(mean(residuals.rf^2))
MAE.rf <- mean(abs(residuals.rf))
SVM:
residuals.svm <- predict(svm.fit, newdata = validation) -
validation$Defect.Remediation.Time.In.hours
MSE.svm <- (mean(residuals.svm^2))
RMSE.svm <- sqrt(mean(residuals.svm^2))
MAE.svm <- mean(abs(residuals.svm))
residuals4.svm <-((validation$Defect.Remediation.Time.In.hours - predict(svm.fit,
newdata = validation))/validation$Defect.Remediation.Time.In.hours)*100
MAPE.svm <-mean(abs(residuals4.svm))
95
#Using Testing Dataset
RF:
rf.fit <- train(Defect.Remediation.Time.In.hours ~ Defects + Requirements +
Test.Cases + Lines.of.Code, data = testing, method = "rf", importance =
TRUE)
rf.fit
SVM:
svm.fit <- train(Defect.Remediation.Time.In.hours ~ Defects + Requirements +
Test.Cases + Lines.of.Code, data = testing, method = "svmRadial") #
svmRadial
svm.fit
96
� Results:
Table D-2 compares the measure of errors and determine the best model for defect
remediation time prediction. Random forest was the best fit model for predicting defect
remediation time with r-square of 70.9%.
Table D-2. Measure of Errors for Defect Remediation Time Prediction
Dataset Type Measure of Errors Random Forest Support Vector Machine
Training R-Square 0.83 0.49
Validation
MAE 4.3 5.39
MAPE 95.86% 119.85%
MSE 29.367 59.28
RMSE 5.42 7.69
Testing R-Square 0.709 0.392