Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate...

18
e-Informatica Software Engineering Journal, Volume 12, Issue 1, 2018, pages: 199–216, DOI 10.5277/e-Inf180108 Applying Machine Learning to Software Fault Prediction Bartlomiej Wójcicki * , Robert Dąbrowski * * Institute of Informatics, University of Warsaw [email protected], [email protected] Abstract Introduction: Software engineering continuously suffers from inadequate software testing. The automated prediction of possibly faulty fragments of source code allows developers to focus development efforts on fault-prone fragments first. Fault prediction has been a topic of many studies concentrating on C/C++ and Java programs, with little focus on such programming languages as Python. Objectives: In this study the authors want to verify whether the type of approach used in former fault prediction studies can be applied to Python. More precisely, the primary objective is conducting preliminary research using simple methods that would support (or contradict) the expectation that predicting faults in Python programs is also feasible. The secondary objective is establishing grounds for more thorough future research and publications, provided promising results are obtained during the preliminary research. Methods: It has been demonstrated [1] that using machine learning techniques, it is possible to predict faults for C/C++ and Java projects with recall 0.71 and false positive rate 0.25. A similar approach was applied in order to find out if promising results can be obtained for Python projects. The working hypothesis is that choosing Python as a programming language does not significantly alter those results. A preliminary study is conducted and a basic machine learning technique is applied to a few sample Python projects. If these efforts succeed, it will indicate that the selected approach is worth pursuing as it is possible to obtain for Python results similar to the ones obtained for C/C++ and Java. However, if these efforts fail, it will indicate that the selected approach was not appropriate for the selected group of Python projects. Results: The research demonstrates experimental evidence that fault-prediction methods similar to those developed for C/C++ and Java programs can be successfully applied to Python programs, achieving recall up to 0.64 with false positive rate 0.23 (mean recall 0.53 with false positive rate 0.24). This indicates that more thorough research in this area is worth conducting. Conclusion: Having obtained promising results using this simple approach, the authors conclude that the research on predicting faults in Python programs using machine learning techniques is worth conducting, natural ways to enhance the future research being: using more sophisticated machine learning techniques, using additional Python-specific features and extended data sets. Keywords: classifier, fault prediction, machine learning, metric, Naïve Bayes, Python, quality, software intelligence 1. Introduction Software engineering is concerned with the de- velopment and maintenance of software sys- tems. Properly engineered systems are reliable and they satisfy user requirements while at the same time their development and maintenance is affordable. In the past half-century computer scientists and software engineers have come up with nu- merous ideas for how to improve the discipline of software engineering. Structural programming [2]

Transcript of Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate...

Page 1: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

e-Informatica Software Engineering Journal, Volume 12, Issue 1, 2018, pages: 199–216, DOI 10.5277/e-Inf180108

Applying Machine Learning to Software FaultPrediction

Bartłomiej Wójcicki∗, Robert Dąbrowski∗∗Institute of Informatics, University of Warsaw

[email protected], [email protected]

AbstractIntroduction: Software engineering continuously suffers from inadequate software testing. Theautomated prediction of possibly faulty fragments of source code allows developers to focusdevelopment efforts on fault-prone fragments first. Fault prediction has been a topic of manystudies concentrating on C/C++ and Java programs, with little focus on such programminglanguages as Python.Objectives: In this study the authors want to verify whether the type of approach used informer fault prediction studies can be applied to Python. More precisely, the primary objectiveis conducting preliminary research using simple methods that would support (or contradict) theexpectation that predicting faults in Python programs is also feasible. The secondary objectiveis establishing grounds for more thorough future research and publications, provided promisingresults are obtained during the preliminary research.Methods: It has been demonstrated [1] that using machine learning techniques, it is possible topredict faults for C/C++ and Java projects with recall 0.71 and false positive rate 0.25. A similarapproach was applied in order to find out if promising results can be obtained for Python projects.The working hypothesis is that choosing Python as a programming language does not significantlyalter those results. A preliminary study is conducted and a basic machine learning technique isapplied to a few sample Python projects. If these efforts succeed, it will indicate that the selectedapproach is worth pursuing as it is possible to obtain for Python results similar to the ones obtainedfor C/C++ and Java. However, if these efforts fail, it will indicate that the selected approach wasnot appropriate for the selected group of Python projects.Results: The research demonstrates experimental evidence that fault-prediction methods similarto those developed for C/C++ and Java programs can be successfully applied to Python programs,achieving recall up to 0.64 with false positive rate 0.23 (mean recall 0.53 with false positive rate0.24). This indicates that more thorough research in this area is worth conducting.Conclusion: Having obtained promising results using this simple approach, the authors concludethat the research on predicting faults in Python programs using machine learning techniques isworth conducting, natural ways to enhance the future research being: using more sophisticatedmachine learning techniques, using additional Python-specific features and extended data sets.

Keywords: classifier, fault prediction, machine learning, metric, Naïve Bayes, Python,quality, software intelligence

1. Introduction

Software engineering is concerned with the de-velopment and maintenance of software sys-tems. Properly engineered systems are reliableand they satisfy user requirements while at the

same time their development and maintenanceis affordable.

In the past half-century computer scientistsand software engineers have come up with nu-merous ideas for how to improve the discipline ofsoftware engineering. Structural programming [2]

Page 2: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

200 Bartłomiej Wójcicki, Robert Dąbrowski

restricted the imperative control flow to hierar-chical structures instead of ad-hoc jumps. Com-puter programs written in this style were morereadable, easier to understand and reason about.Another improvement was the introduction ofan object-oriented paradigm [3] as a formal pro-gramming concept.

In the early days software engineers perceivedsignificant similarities between software and civilengineering processes. The waterfall model [4],which resembles engineering practices, was widelyadopted as such regardless of its original descrip-tion actually suggesting a more agile approach.

It has soon turned out that building softwarediffers from building skyscrapers and bridges, andthe idea of extreme programming emerged [5], itskey points being: keeping the code simple, review-ing it frequently and early and frequent testing.Among numerous techniques, a test-driven devel-opment was promoted which eventually resultedin the increased quality of produced softwareand the stability of the development process[6]. Contemporary development teams startedto lean towards short iterations (sprints) ratherthan fragile upfront designs, and short feedbackloops, thus allowing customers’ opinions to pro-vide timely influence on software development.This meant creating even more complex softwaresystems.

The growing complexity of software resultedin the need to describe it at different lev-els of abstraction, and, in addition to this,the notion of software architecture has devel-oped. The emergence of patterns and frame-works had a similar influence on the architec-ture as design patterns and idioms had on pro-gramming. Software started to be developedby assembling reusable software componentswhich interact using well-defined interfaces, whilecomponent-oriented frameworks and models pro-vided tools and languages making them suitablefor formal architecture design.

However, a discrepancy between the archi-tecture level of abstraction and the program-ming level of abstraction prevailed. While theprogramming phase remained focused on gen-erating a code within a preselected (typicallyobject-oriented) programming language, the ar-

chitecture phase took place in the disconnectedcomponent world. The discrepancies typicallydeepened as the software kept gaining featureswithout being properly refactored, developmentteams kept changing over time working undertime pressure with incomplete documentationand requirements that were subject to frequentchanges. Multiple development technologies, pro-gramming languages and coding standards madethis situation even more severe. The unificationof modelling languages failed to become a silverbullet.

The discrepancy accelerated research on soft-ware architecture and the automation of softwareengineering. This includes the vision for the auto-mated engineering of software based on architec-ture warehouse and software intelligence [7] ideas.The architecture warehouse denotes a repositoryof the whole software system and software pro-cess artefacts. Such a repository uniformly cap-tures and regards as architectural all informationwhich was previously stored separately in designdocuments, version-control systems or simply inthe minds of software developers. Software intel-ligence denotes a set of tools for the automatedanalysis, optimization and visualization of thewarehouse content [8, 9].

An example of this approach is combininginformation on source code artefacts, such asfunctions, with the information on software pro-cess artefacts, such as version control commentsindicating the developers’ intents behind changesin given functions. Such an integration of sourcecode artefacts and software process artefacts al-lows to aim for more sophisticated automatedlearning and reasoning in the area of softwareengineering, for example obtaining an ability toautomatically predict where faults are likely tooccur in the source code during the softwareprocess.

The automated prediction of possibly faultyfragments of the source code, which allows de-velopers to focus development efforts on the bugprone modules first, is the topic of this research.This is an appealing idea since, according toa U.S. National Institute of Standards and Tech-nology’s study [10], inadequate software testinginfrastructure costs the U.S. economy an esti-

Page 3: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

Applying Machine Learning to Software Fault Prediction 201

mated $60 billion annually. One of the factorsthat could yield savings is identifying faults atearlier development stages.

For this reason, fault prediction was the sub-ject of many previous studies. As yet, softwareresearchers have concluded that defect predictorsbased on machine learning methods are practi-cal [11] and useful [12]. Such studies were usu-ally focused on C/C++ and Java projects [13]omitting other programming languages, such asPython.

This study demonstrates experimentally thattechniques used in the former fault predictionstudies can be successfully applied to the softwaredeveloped in Python. The paper is organized asfollows: in section 2 the related works are re-called; in section 3 the theoretic foundations andimplementation details of the approach beingsubject of this study are highlighted; the mainresults are presented in section 4, with conclu-sions to follow in section 5. The implementationof the method used in this study for predictorevaluation is outlined in the Appendix, it can beused to reproduce the results of the experiments.The last section contains bibliography.

2. Related work

Software engineering is a sub-field of appliedcomputer science that covers the principles andpractice of architecting, developing and maintain-ing software. Fault prediction is a software en-gineering problem. Artificial intelligence studiessoftware systems that are capable of intelligentreasoning. Machine learning is a part of artifi-cial intelligence dedicated to one of its centralproblems - automated learning. In this researchmachine learning methods are applied to a faultprediction problem.

For a given Python software project, the ar-chitectural information warehoused in the projectrepository is used to build tools capable of auto-mated reasoning about possible faults in a givensource code. More specifically: (1) a tool ableto predict which parts of the source code arefault-prone is developed; and (2) its operation isdemonstrated on five open-source projects.

Prior works in this field [1] demonstrated thatit is possible to predict faults for C/C++ andJava projects with a recall rate of 71% and a falsepositive rate of 25%. The tool demonstrated inthis paper demonstrates that it is possible topredict faults in Python achieving recall ratesup to 64% with a false positive rate of 23% forsome projects; for all tested projects the achievedmean recall was 53% with a false positive rateof 24%.

Fault prediction spans multiple aspects ofsoftware engineering. On the one hand, it isa software verification problem. In 1989 Boehm[14] defined the goal of verification as an an-swer to the question Are we building the productright? Contrary to formal verification methods(e.g. model checking), fault predictors cannot beused to prove that a program is correct; they can,however, indicate the parts of the software thatare suspected of containing defects.

On the other hand, fault prediction is re-lated to software quality management. In 2003Khoshgoftaar et al. [15] observed that it can beparticularly helpful in prioritizing quality assur-ance efforts. They studied high-assurance andmission-critical software systems heavily depen-dent on the reliability of software applications.They evaluated the predictive performance ofsix commonly used fault prediction techniques.Their case studies consisted of software metricscollected over large telecommunication system re-leases. During their tests it was observed that pre-diction models based on software metrics couldactually predict the number of faults in softwaremodules; additionally, they compared the perfor-mance of the assessed prediction models.

Static code attributes have been used for theidentification of potentially problematic parts ofa source code for a long time. In 1990 Porter etal. [16] addressed the issue of the early identifica-tion of high-risk components in the software lifecycle. They proposed an approach that derivedthe models of problematic components based ontheir measurable attributes and the attributes oftheir development processes. The models allowedto forecast which components were likely to sharethe same high-risk properties, such as like beingerror-prone or having a high development cost.

Page 4: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

202 Bartłomiej Wójcicki, Robert Dąbrowski

Table 1. Prior results of fault predictors usingNASA data sets [17]

Data set Language Recall False positive rate

PC1 C 0.24 0.25JM1 C 0.25 0.18CM1 C 0.35 0.10KC2 C++ 0.45 0.15KC1 C++ 0.50 0.15

In total: 0.36 0.17

Table 2. Prior results of fault predictors usingNASA data sets [1] (logarithmic filter applied)

Data set Language Recall False positive rate

PC1 C 0.48 0.17MW1 C 0.52 0.15KC3 Java 0.69 0.28CM1 C 0.71 0.27PC2 C 0.72 0.14KC4 Java 0.79 0.32PC3 C 0.80 0.35PC4 C 0.98 0.29

in total: 0.71 0.25

In 2002, the NASA Metrics Data ProgramData sets were published [18]. Each data setcontained complexity metrics defined by Hal-stead and McCabe, the lines of code metricsand defect rates for the modules of a differentsubsystem of NASA projects. These data setsincluded projects in C, C++ and Java. Multiplestudies that followed used these data sets andsignificant progress in this area was made.

In 2003 Menzies et al. examined decisiontrees and rule-based learners [19–21]. They re-searched a situation when it is impractical torigorously assess all parts of complex systemsand test engineers must use some kind of defectdetectors to focus their limited resources. Theydefined the properties of good defect detectorsand assessed different methods of their gener-ation. They based their assessments on staticcode measures and found that (1) such defectdetectors yield results that are stable across manyapplications, and (2) the detectors are inexpen-sive to use and can be tuned to the specificsof current business situations. They consideredpractical situations in which software costs areassessed and additionally assumed that betterassessment allowed to earn exponentially moremoney. They pointed out that given finite bud-gets, assessment resources are typically skewedtowards areas that are believed to be missioncritical; hence, the portions of the system thatmay actually contain defects may be missed.They indicated that by using proper metrics andmachine learning algorithms, quality indicatorscan be found early in the software developmentprocess.

In 2004 Menzies et al. [17] assessed other pre-dictors of software defects and demonstrated thatthese predictors are outperformed by Naïve Bayesclassifiers, reporting a mean recall of 0.36 witha false positive rate of 0.17 (see Table 1). Moreprecisely they demonstrated that when learningdefect detectors from static code measures, NaïveBayes learners are better than entropy-baseddecision-tree learners, and that accuracy is nota useful way to assess these detectors. They alsoargued that such learners need no more than200–300 examples to learn adequate detectors,especially when the data has been heavily strati-fied; i.e. divided into sub-sub-sub systems.

In 2007 Menzies et al. [1] proposed applyinga logarithmic filter to features. The value of usingstatic code attributes to learn defect predictorswas widely debated. Prior work explored issuessuch as the merits of McCabes versus Halsteadversus the lines of code counts for generating de-fect predictors. They showed that such debatesare irrelevant since how the attributes are usedto build predictors is much more important thanwhich particular attributes are actually used.They demonstrated that adding a logarithmicfilter resulted in improving recall to 0.71, keepinga false positive rate reasonably low at 0.25 (seeTable 2).

In 2012 Hall et al. [13] identified and anal-ysed 208 defect prediction studies published fromJanuary 2000 to December 2010. By a system-atic review, they drew the following conclusions:(1) there are multiple types of features that canbe used for defect prediction, including staticcode metrics, change metrics and previous fault

Page 5: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

Applying Machine Learning to Software Fault Prediction 203

metrics; (2) there are no clear best bug-pronenessindicators; (3) models reporting a categoricalpredicted variable (e.g. fault prone or not faultprone) are more prevalent than models report-ing a continuous predicted variable; (4) variousstatistical and machine learning methods can beemployed to build fault predictors; (5) industrialdata can be reliably used, especially data publiclyavailable in the NASA Metrics Data Programdata sets; (6) fault predictors are usually devel-oped for C/C++ and Java projects.

In 2016 Lanza et al. [22] criticized the evalu-ation methods of defect prediction approaches;they claimed that in order to achieve substantialprogress in the field of defect prediction (alsoother types of predictions), researchers shouldput predictors out into the real world and havethem assessed by developers who work on a livecode base, as defect prediction only makes senseif it is used in vivo.

The main purpose of this research is to extendthe range of analysed programming languagesto include Python. In the remaining part of thepaper it is experimentally demonstrated that itis possible to predict defects for Python projectsusing static code features with an approach sim-ilar to (though not directly replicating) the onetaken by Menzies et al. [1] for C/C++ and Java.

3. Problem definition

For the remaining part of this paper let fault de-note any flaw in the source code that can causethe software to fail to perform its required func-tion. Let repository denote the storage locationfrom which the source code may be retrievedwith version control capabilities that allow toanalyse revisions denoting the precisely specifiedincarnations of the source code at a given pointin time. For a given revision K let K∼1 denoteits parent revision, K∼2 denote its grandpar-ent revision, etc. Let software metric denote themeasure of a degree to which a unit of softwarepossesses some property. Static metrics can becollected for software without executing it, incontrast to the dynamic ones. Let supervisedlearning denote a type of machine learning task

where an algorithm learns from a set of trainingexamples with assigned expected outputs [23].

The authors follow with the definition centralto the problem researched in this paper.Definition 3.1. Let a classification problem de-note an instance of a machine learning problem,where the expected output is categorical, that iswhere: a classifier is the algorithm that imple-ments the classification; a training set is a setof instances supplied for the classifier to learnfrom; a testing set is a set of instances used forassessing classifier performance; an instance isa single object from which the classifier will learnor on which it will be used, usually representedby a feature vector with features being individualmeasurable properties of the phenomenon beingobserved, and a class being the predicted vari-able, that is the output of the classifier for thegiven instance.

In short: in classification problems classifiersassign classes to instances based on their features.

Fault prediction is a process of predictingwhere faults are likely to occur in the sourcecode. In this case machine learning algorithmsoperate on instances being units of code (e.g. func-tions, classes, packages). Instances are representedby their features being the properties of the so-urce code that indicate the source code unit’sfault-proneness (e.g. number of lines of code, num-ber of previous bugs, number of comments). Thefeatures are sometimes additionally preprocessed;an example of a feature preprocessor, called a log-arithmic filter, substitutes the values of featureswith their logarithms. For the instances in thetraining set the predicted variable must be pro-vided; e.g. the instances can be reviewed by ex-perts and marked as fault-prone or not fault-prone.After the fault predictor learns from the trainingset of code units, it can be used to predict thefault-proneness of the new units of the code. Theprocess is conceptually depicted in Figure 1.

A confusion matrix is a matrix containing thecounts of instances grouped by the actual andpredicted class. For the classification problemit is a 2 × 2 matrix (as depicted in Table 3).The confusion matrix and derived metrics canbe used to evaluate classifier performance, wherethe typical indicators are as follows:

Page 6: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

204 Bartłomiej Wójcicki, Robert Dąbrowski

to have predicted variable provided, for example they can be reviewed by experts and markedas ‘fault-prone’ or ‘not fault-prone’. A fault predictor learns from the training set and, afterthat, it can be used to predict defect-proneness of new code units.

Figure 1.1: Fault prediction problem (example).

1.4. Performance metrics

In this study, my goal was to develop a fault predictor capable of identifying a large part offaults in a given project. For this reason, I measure recall, which is a fraction of fault-proneinstances that are labeled as such by the fault predictor. Labour intensive, manual codeinspections can find ≈ 60 percent of defects [38]. I aimed to reach similar level of recall.

Recall alone is not enough to properly assess performance of a fault predictor. A trivialfault predictor that labels all functions as fault-prone achieves 100% recall. It is therefore agood practice to report false positive rate among recall. For fault prediction problem, falsepositive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.False positive rate of a fault predictor should be lower than its recall, as a predictor randomlylabeling p% of functions as fault-prone on average achieves a recall and false alarm rate ofp%.

13

Figure 1. Fault prediction problem (sample)

Table 3. Confusion matrix for classification problems

Actual/predicted Negative Positive

negative true negative (tn) false positive (fp)positive false negative (fn) true postive (tp)

Definition 3.2. Let recall denote a fraction ofactual positive class instances that are correctlyassigned to positive class:

tptp + fn

Let precision denote a fraction of predictedpositive class instances that actually are in thepositive class:

tptp + fp

Let a false positive rate denote a fraction ofactual negative class instances that are incor-rectly assigned to the positive class:

fpfp + tn

Let accuracy denote a fraction of instancesassigned to correct classes:

tp + tntp + fp + tn + fn

The remaining part of this section containstwo subsections. In 3.1 the classification prob-lem analysed in this study is stated in termstypical to machine learning, that is instances:what kinds of objects are classified; classes: intowhat classes are they are divided; features: whatfeatures are used to describe them; classifier:which learning method is used. Section 3.2 fo-cuses on the practical aspects of fault predictionand describes the operational phases of the im-plementation: identification of instances, featureextraction, generation of a training set, trainingand predicting.

3.1. Classification problem definition

3.1.1. Instances

The defect predictor described in this study op-erates at the function level, which is a de factostandard in this field [13]. As the first rule offunctions is that they should be small [24], it

Page 7: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

Applying Machine Learning to Software Fault Prediction 205

was assumed that it should be relatively easy fordevelopers to find and fix a bug in a functionreported as fault-prone by a function-level faultpredictor. Hence, in this research functions beinginstances of problem definition were selected.

3.1.2. Classes

For simplicity of reasoning, in this research theseverity of bugs is not predicted. Hence, prob-lem definition instances are labelled as eitherfault-prone or not fault-prone.

3.1.3. Features

To establish defect predictors the code complex-ity measures as defined by McCabe [25] andHalstead [26] were used.

The following Halstead’s complexity mea-sures were applied in this study as code metricsfor estimating programming effort. They esti-mate complexity using operator and operandcounts and are widely used in fault predictionstudies [1].Definition 3.3. Let n1 denote the count of dis-tinct operators, n2 denote the count of distinctoperands, N1 denote the total count of oper-ators, N2 denote the total count of operands.Then Halstead metrics are defined as follows:program vocabulary n = n1 +n2; program lengthN = N1 + N2; calculated program length N =n1 log2 n1 + n2 log2 n2; volume V = N × log2 n;difficulty D = n1/2×N2/n2; effort E = D × V ;time required to program T = E/18 seconds;number of delivered bugs B = V/3000.

In this research all the metrics defined above,including the counters of operators and operands,are used as features; in particular preliminary re-search indicated that limiting the set of featuresleads to results with lower recall.

In the study also The McCabe’s cyclomaticcomplexity measure, being quantitative measureof the number of linearly independent pathsthrough a program’s source code, was applied.In terms of the software’s architecture graph,cyclomatic complexity is defined as follows.Definition 3.4. Let G be the flow graph be-ing a subgraph of the software architecture

graph, where e denotes the number of edgesin G and n denotes the number of nodes in G.Then cyclomatic complexity CC is defined asCC(G) = e− n + 2.

It is worth noting that some researchers op-pose using cyclomatic complexity for fault predic-tion. Fenton and Pfleeger argue that it is highlycorrelated with the lines of code, thus it carrieslittle information [27]. However, other researchersused McCabe’s complexity to build successfulfault predictors [1]. Also industry keeps recog-nizing cyclomatic complexity measure as usefuland uses it extensively, as it is straightforwardand can be communicated across the differentlevels of development stakeholders [28]. In thisresearch the latter opinions are followed.

3.1.4. Classifier

In this study, the authors opted for using a NaïveBayes classifier. Naïve Bayes classifiers are a fam-ily of supervised learning algorithms based on ap-plying Bayes’ theorem with naïve independenceassumption between the features. In preliminaryexperiments, this classifier achieved significantlyhigher recall than other classifiers that were pre-liminary considered. Also, as mentioned in sec-tion 2, it achieved best results in previous faultprediction studies [1].

It should be noted that for a class variabley and features x1, . . . , xn, Bayes’ theorem statesthe following relationship:

P (y|x1, . . . , xn) = P (y)P (x1, . . . , xn|y)P (x1, . . . , xn) .

This relationship can be simplified using thenaïve independence assumption:

P (y|x1, . . . , xn) = P (y)∏n

i=1 P (xi|y)P (x1, . . . , xn) .

Since P (x1, . . . , xn) does not depend on y, thenthe following classification rule can be used:

y = arg maxy

P (y)n∏

i=1P (xi|y),

where P (y) and P (xi|y) can be estimated usingthe training set. There are multiple variants of

Page 8: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

206 Bartłomiej Wójcicki, Robert Dąbrowski

the Naïve Bayes classifier; in this paper a Gaus-sian Naïve Bayes classifier is used which assumesthat the likelihood of features is Gaussian.

3.2. Classification problemimplementation

3.2.1. Identification of instances

A fault predicting tool must be able to encodea project as a set of examples. The identificationof instances is the first step of this process. Thistool implements it as follows: (1) it retrieves a listof files in a project from a repository (Git); (2)it limits results to a source code (Python) files;(3) for each file it builds an Abstract SyntaxTree (AST) and walks the tree to find the nodesrepresenting source code units (functions).

3.2.2. Feature extraction

A fault predictor expects instances to be rep-resented by the vectors of features. This toolextracts those features in the following way. Hal-stead metrics are derived from the counts ofoperators and operands. To calculate them fora given instance, this tool performs the followingsteps: (1) it extracts a line range for a functionfrom AST; (2) it uses a lexical scanner to tokenizefunction’s source; (3) for each token it decideswhether the token is an operator or an operand,or neither. First of all the token type is used todecide if it is an operator or operand, see Table 4.

If the token type is not enough to distinguishbetween an operator and an operand; then iftokenize.NAME indicates tokens are Python key-words, they are considered operators; otherwisethey are considered operands. McCabe’s complex-ity for functions is calculated directly from AST.Table 5 presents effects of Python statements oncyclomatic complexity score.

3.2.3. Training set generation

Creating a fault predicting tool applicable tomany projects can be achieved either by train-ing a universal model, or by training predictorsindividually for each project [29]. This research

adopts the latter approach: for each project itgenerates a training set using data extractedfrom the given project repository. Instances inthe training set have to be assigned to classes; inthis case software functions have to be labelled aseither fault-prone or not fault-prone. In previousstudies, such labels were typically assigned byhuman experts, which is a tedious and expen-sive process. In order to avoid this step, thistool relies on the following general definition offault-proneness:Definition 3.5. For a given revision, functionis fault-prone if it was fixed in one of K nextcommits, where the choice of K should dependon the frequency of commits.

The definition of fault proneness can be ex-tended due to the fact that relying on a projectarchitecture warehouse enables mining informa-tion in commit logs. For identification of commitsas bug-fixing in this research a simple heuris-tic, frequently used in previous studies, was fol-lowed [30,31].Definition 3.6. Commit is bug-fixing if its logcontains any of the following words: bug, fix, is-sue.

Obviously such a method of generating train-ing data is based on the assumption that bugfixing commits are properly marked and containonly fixes, which is consistent with the best prac-tices for Git [32]. It is worth noting that since thismight not be the general case for all projects, thetool in its current format is not recommended forpredicting faults in projects that do not followthese practices.

3.2.4. Training and predicting

Training a classifier and making predictions fornew instances are the key parts of a fault pre-dictor. For these phases, the tool relies on Gaus-sianNB from the Scikit-learn (scikit-learn.org)implementation of the Naïve Bayes classifier.

4. Main result

The tool’s performance was experimentally as-sessed on five arbitrarily selected open-source

Page 9: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

Applying Machine Learning to Software Fault Prediction 207

Table 4. Operator and operand types

OPERATOR_TYPES = [ tokenize.OP, tokenize.NEWLINE, tokenize.INDENT, tokenize.DEDENT]

OPERAND_TYPES = [ tokenize.STRING, tokenize.NUMBER ]

Table 5. Contribution of Python constructs to cyclomatic complexity

Construct Effect Reasoning

if +1 An if statement is a single decisionelif +1 The elif statement adds another decisionelse 0 Does not cause a new decision - the decision is at the iffor +1 There is a decision at the start of the loopwhile +1 There is a decision at the while statementexcept +1 Each except branch adds a new conditional path of executionfinally 0 The finally block is unconditionally executedwith +1 The with statement roughly corresponds to a try/except blockassert +1 The assert statement internally roughly equals a conditional statementcomprehension +1 A list/set/dict comprehension of generator expression is equivalent to a for looplambda +1 A lambda function is a regular functionboolean +1 Every boolean operator (and, or) adds a decision point

Table 6. Projects used for evaluation

Project Location at github.com

Flask /mitsuhiko/flaskOdoo /odoo/odooGitPython /gitpython-developers/GitPythonAnsible /ansible/ansibleGrab /lorien/grab

Table 7. Summary of projects used for evaluation:projects’ revisions (Rv) with corresponding numberof commits (Co), branches (Br), releases (Rl) and

contributors (Cn)

Project Rv Cm Br Rl Cn

Flask 7f38674 2319 16 16 277Odoo 898cae5 94106 12 79 379GitPython 7f8d9ca 1258 7 20 67Ansible 718812d 15935 34 76 1154Grab e6477fa 1569 2 0 32

projects of different characteristics: Flask – a webdevelopment micro-framework; Odoo – a collec-tion of business apps; GitPython – a libraryto interact with Git repositories; Ansible – anIT automation system; Grab – a web scrapingframework. Analyzed software varies in scope andcomplexity: from a library with narrow scope,through frameworks, to a powerful IT automa-tion platform and a fully-featured ERP system.All projects are publicly available on GitHub (seeTable 6) and are under active development.

Data sets for evaluation were generated fromprojects using method described in section 3,namely: features were calculated for revisionHEAD∼ 100, where HEAD is a revision specifiedin Table 7; functions were labeled as fault-prone

if they were modified in bug-fixing commit be-tween revisions HEAD ∼100 and HEAD; dataset was truncated to files modified in any com-mit between revisions HEAD ∼100 and HEAD.Table 8 presents total count and incidence offault-prone functions for each data set.

As defined in section 3, recall and false pos-itive rates were used to assess the performanceof fault predictors. In terms of these metrics,a good fault predictor should achieve: high recall– a fault predictor should identify as many faultsin the project as possible; if two predictors obtainthe same false positive rate, the one with higherrecall is preferred, as it will yield more fault-pronefunctions; low false positive rate – code unitsidentified as bug prone require developer action;

Page 10: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

208 Bartłomiej Wójcicki, Robert Dąbrowski

Table 8. Data sets used for evaluation

Project Functions Fault-prone % fault-prone

Flask 786 30 3.8Odoo 1192 50 4.2GitPython 548 63 11.5Ansible 752 69 9.2Grab 417 31 7.4

Table 9. Results for the best predictor

Project Recall False positive ratemean SD mean SD

Flask 0.617 0.022 0.336 0.005Odoo 0.640 < 0.001 0.234 0.003GitPython 0.467 0.019 0.226 0.003Ansible 0.522 < 0.001 0.191 0.002Grab 0.416 0.010 0.175 0.004

In total: 0.531 < 0.03 0.240 < 0.03

the predictor with fewer false alarms requires lesshuman effort, as it returns less functions that areactually not fault-prone.

It is worth noting that Zhang and Zhang [33]argue that a good prediction model should actu-ally achieve both high recall and high precision.However, Menzies et al. [34] advise against usingprecision for assessing fault predictors, as it isless stable across different data sets than the falsepositive rate. This study follows this advice.

For this research a stratified 10-fold cross val-idation was used as a base method for evaluatingpredicting performance. K-fold cross validationdivides instances from the training set into Kequal sized buckets, and each bucket is then usedas a test set for a classifier trained on the remain-ing K− 1 buckets. This method ensures that theclassifier is not evaluated on instances it usedfor learning and that all instances are used forvalidation.

As bug prone functions were rare in the train-ing sets, folds were stratified, i.e. each fold con-tained roughly the same proportions of samplesfor each label.

This procedure was additionally repeated 10times, each time randomizing the order of ex-amples. This step was added to check whetherpredicting performance depends on the order ofthe training set. A similar process was used byother researchers (e.g. [1, 35]).Main result 1. The fault predictor presentedin this research achieved recall up to 0.64 withfalse positive rate 0.23 (mean recall 0.53 withfalse positive rate 0.24, see Table 9 for details).

It is worth noting that: the highest recall wasachieved for project Odoo: 0.640; the lowest recallwas achieved for project Grab: 0.416; the lowestfalse positive rate was achieved for project Grab:

0.175; the highest false positive rate was achievedfor project Flask: 0.336. For all data sets recallwas significantly higher than the false positiverate. The results were stable over consecutiveruns; the standard deviation did not exceed 0.03,neither for recall nor for the false positive rate.Main result 2. This research additionally sup-ports the significance of applying the logarithmicfilter, since the fault predictor implemented forthis research without using this filter achievedsignificantly lower mean recall 0.328 with falsepositive rate 0.108 (see Table 10 for details).Table 10. Results for the best predictor without the

logarithmic filter

Project Recall False positive ratemean SD mean SD

Flask 0.290 0.037 0.119 0.004Odoo 0.426 0.009 0.132 < 0.001GitPython 0.273 0.016 0.129 0.006Ansible 0.371 0.012 0.068 < 0.001Grab 0.219 0.028 0.064 0.005

In total: 0.328 0.108

It should be emphasised that similar signifi-cance was indicated in the case of the detectorsfor C/C++ and Java projects in [1].

5. Conclusions

In this study, machine learning methods were ap-plied to a software engineering problem of faultprediction. Fault predictors can be useful fordirecting quality assurance efforts. Prior studiesshowed that static code features can be used forbuilding practical fault predictors for C/C++and Java projects. This research demonstrates

Page 11: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

Applying Machine Learning to Software Fault Prediction 209

that these techniques also work for Python, a pop-ular programming language that was omitted inprevious research. The tool resulting from thisresearch is a function-level fault prediction toolfor Python projects. Its performance was exper-imentally assessed on five open-source projects.On selected projects the tool achieved recall upto 0.64 with false positive rate 0.23, mean recall0.53 with false positive rate 0.24. Leading faultpredictors trained on NASA data sets achievedhigher mean recall 0.71 with similar false posi-tive rate 0.25 [1]. Labour intensive, manual codeinspections can find about 60% of defects [36].This research is close to reaching a similar levelof recall. The performance of this tool can beperceived as satisfactory, certainly proving thehypothesis that predicting faults for Python pro-grams has a similar potential to that of C/C++and Java programs, and that more thoroughfuture research in this area is worth conducting.

5.1. Threats to validity

Internal There are no significant threats to in-ternal validity. The goal was to take an approachinspired by the experiments conducted by Men-zies et al. [1] The experimental results for Pythondemonstrated to be consistent with the ones re-ported for C/C++ and Java, claiming that: staticcode features are useful for the identification offaults, fault predictors using the Naïve Bayesclassifier perform well, however, using a logarith-mic filter is encouraged, as it improves predictingperformance. Using other methods of extractingfeatures used for machine learning (i.e. Pythonfeatures which are absent in C/C++ or Java),could potentially lead to a better performanceof the tool.External There are threats to external validity.The results obtained in this research are not validfor generalization from the context in which thisexperiment was conducted to a wider context.More precisely, the range of five arbitrarily se-lected software projects provides experimentalevidence that this direction of research is worthpursuing; however, by itself it does not provideenough evidence for general conclusions and more

thorough future research is required. Also the toolperformance was assessed only in terms of recalland false positive rates, it has not been actuallyverified in practice. It is thus possible that thetool current predicting ability might prove notgood enough for practical purposes and its fur-ther development will be required. Therefore, theconclusion of the universal practical applicabilityof such an approach cannot be drawn yet.Construct There are no significant threats toconstruct validity. In this approach the authorswere not interested in deciding whether it is a wellselected machine learning technique, project at-tributes used for learning or the completenessof fault proneness definition for the training-setthat were mainly contributing to the tool perfor-mance. The important conclusion was that theresults obtained do not exclude but support thehypothesis, that automated fault prediction inPython allows to obtain accuracy comparable tothe results obtained for other languages and tohuman-performed fault prediction, hence theyencourage more research in this area. Thus, theresults provided in this paper serve as an exampleand the rough estimation of predicting perfor-mance expected nowadays from fault predictorsusing static code features. There are few addi-tional construct conditions worth mentioning.As discussed in section 3, the tool training setgeneration method relies on project change logsbeing part of the project architecture warehouse.If bug-fixing commits are not properly labelled,or contain not only fixes, then the generated datasets might be skewed. Clearly, the performance ofthe tool can be further improved, as it is not yetas good as the performance of fault predictors forC/C++ and Java; the current result is a goodstart for this improvement. Comparing the per-formance of classifiers using different data sets isnot recommended, as predictors performing wellon one set of data might fail on another.Conclusion There are no significant threats toconclusion validity. Fault recall (detection rate)alone is not enough to properly assess the per-formance of a fault predictor (i.e. a trivial faultpredictor that labels all functions as fault-proneachieves total recall), hence the focus on both re-

Page 12: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

210 Bartłomiej Wójcicki, Robert Dąbrowski

call (detection) and false positives (false alarms).Obviously the false positive rate of a fault predic-tor should be lower than its recall, as a predictorrandomly labelling p of functions as fault-proneon average achieves a recall and false positiverate of p. This has been achieved in this study,similarly to [1]. From the practical perspective, inthis research the goal recognizing automaticallyas many relevant (erroneous) functions as possi-ble, which later should be revised manually byprogrammers; that is the authors were interestedin achieving high recall and trading precisionfor recall if needed. From the perspective of thisresearch goals, evaluating classifiers by measuresother than those used in [1] (i.e. using otherelements in the confusion matrix) was not di-rectly relevant for the conclusions presented inthis paper.

5.2. Future research

Additional features As mentioned in sec-tion 3, static code metrics are only a subset offeatures that can be used for training fault pre-dictors. In particular, methods utilizing previousdefect data, such as, [37] can also be useful forfocusing code inspection efforts [38, 39]. Changedata, such as code churn or fine-grained codechanges were also reported to be significant bugindicators [40–42]. Adding support for these fea-tures might augment their fault predicting capa-bilities. Moreover, further static code features,such as object oriented metrics defined by Chi-damber and Kemerer [43] can be used for bugprediction [32,44]. With more attributes, addinga feature selection step to the tool might alsobe beneficial. Feature selection can also improvetraining times, simplify the model and reduceoverfitting.Additional algorithms The tool uses a NaïveBayes classifier for predicting software defects.In preliminary experiments different learning al-gorithms were assessed, but they performed sig-nificantly worse. It is possible that with more fea-tures supplied and fine-tuned parameters these al-gorithms could eventually outperform the NaïveBayes classifier. Prediction efficiency could alsobe improved by including some strategies for

eliminating class imbalance [45] in the data sets.Researchers also keep proposing more sophisti-cated methods for identifying bug-fixing commitsthan the simple heuristic used in this research, inparticular high-recall automatic algorithms forrecovering links between bugs and commits havebeen developed. Integrating algorithms, suchas [46] into a training set generation process couldimprove the quality of the data and, presumably,tool predicting performance.Additional projects In preliminary experi-ments, a very limited number of Python projectswere used for training and testing. Extending theset of Python projects contributing to the train-ing and testing sets is needed to generalize theconclusions. The selection of additional projectsshould be conducted in a systematic manner.A live code could be used for predictor evalu-ation [22], which means introducing predictorsinto the development toolsets used by softwaredevelopers in live software projects. The nextresearch steps should involve a more in-depthdiscussion about the findings on the Python pro-jects, in particular identification why in someprojects the proposed techniques have a betterperformance than in other projects.

References

[1] T. Menzies, J. Greenwald, and A. Frank, “Datamining static code attributes to learn defect pre-dictors,” IEEE Transactions on Software Engi-neering, Vol. 33, No. 1, 2007, pp. 2–13.

[2] E.W. Dijkstra, “Letters to the editor: go to state-ment considered harmful,” Communications ofthe ACM, Vol. 11, No. 3, 1968, pp. 147–148.

[3] J. McCarthy, P.W. Abrams, D.J. Edwards, T.P.Hart, and M.I. Levin, Lisp 1.5 programmer’smanual. The MIT Press, 1962.

[4] W. Royce, “Managing the development of largesoftware systems: Concepts and techniques,” inTechnical Papers of Western Electronic Show andConvention (WesCon), 1970, pp. 328–338.

[5] K. Beck, “Embracing change with extreme pro-gramming,” IEEE Computer, Vol. 32, No. 10,1999, pp. 70–77.

[6] R. Kaufmann and D. Janzen, “Implicationsof test-driven development: A pilot study,” inCompanion of the 18th annual ACM SIGPLANConference on Object-oriented Programming,

Page 13: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

Applying Machine Learning to Software Fault Prediction 211

Systems, Languages, and Applications, ser.OOPSLA ’03. New York, NY, USA: ACM, 2003,pp. 298–299.

[7] R. Dąbrowski, “On architecture warehouses andsoftware intelligence,” in Future Generation In-formation Technology, ser. Lecture Notes inComputer Science, T.H. Kim, Y.H. Lee, andW.C. Fang, Eds., Vol. 7709. Springer, 2012, pp.251–262.

[8] R. Dąbrowski, K. Stencel, and G. Timoszuk,“Software is a directed multigraph,” in 5thEuropean Conference on Software ArchitectureECSA, ser. Lecture Notes in Computer Science,I. Crnkovic, V. Gruhn, and M. Book, Eds.,Vol. 6903. Essen, Germany: Springer, 2011, pp.360–369.

[9] R. Dąbrowski, G. Timoszuk, and K. Stencel,“One graph to rule them all (software measure-ment and management),” Fundamenta Informat-icae, Vol. 128, No. 1-2, 2013, pp. 47–63.

[10] G. Tassey, “The economic impacts of in-adequate infrastructure for software test-ing,” National Institute of Standardsand Technology, Tech. Rep., 2002. [On-line]. https://pdfs.semanticscholar.org/9b68/5f84da00514397d9af7f27cc0b7db7df05c3.pdf

[11] T. Menzies, Z. Milton, B. Turhan, B. Cukic,Y. Jiang, and A.B. Bener, “Defect predictionfrom static code features: Current results, limi-tations, new approaches,” Automated SoftwareEngineering, Vol. 17, No. 4, 2010, pp. 375–407.

[12] S. Lessmann, B. Baesens, C. Mues, andS. Pietsch, “Benchmarking classification modelsfor software defect prediction: A proposed frame-work and novel findings,” IEEE Transactions onSoftware Engineering, Vol. 34, No. 4, 2008, pp.485–496.

[13] T. Hall, S. Beecham, D. Bowes, D. Gray, andS. Counsell, “A systematic literature review onfault prediction performance in software engi-neering,” IEEE Transactions on Software Engi-neering, Vol. 38, No. 6, 2012, pp. 1276–1304.

[14] B.W. Boehm, “Software risk management,” inESEC ’89, 2nd European Software EngineeringConference, ser. Lecture Notes in Computer Sci-ence, C. Ghezzi and J.A. McDermid, Eds., Vol.387. Springer, 1989, pp. 1–19.

[15] T.M. Khoshgoftaar and N. Seliya, “Fault pre-diction modeling for software quality estimation:Comparing commonly used techniques,” Empir-ical Software Engineering, Vol. 8, No. 3, 2003,pp. 255–283.

[16] A.A. Porter and R.W. Selby, “Empirically guidedsoftware development using metric-based clas-

sification trees,” IEEE Software, Vol. 7, No. 2,1990, pp. 46–54.

[17] T. Menzies, J. DiStefano, A. Orrego, and R.M.Chapman, “Assessing predictors of software de-fects,” in Proceedings of Workshop PredictiveSoftware Models, 2004.

[18] T. Menzies, R. Krishna, and D. Pryor, ThePromise Repository of Empirical SoftwareEngineering Data, North Carolina State Univer-sity, Department of Computer Science, (2015).[Online]. http://openscience.us/repo

[19] T. Menzies, J.S.D. Stefano, K. Ammar,K. McGill, P. Callis, R.M. Chapman, andJ. Davis, “When can we test less?” in 9thIEEE International Software Metrics Symposium(METRICS), Sydney, Australia, 2003, p. 98.

[20] T. Menzies, J.S.D. Stefano, and M. Chapman,“Learning early lifecycle IV&V quality indica-tors,” in 9th IEEE International Software Met-rics Symposium (METRICS), Sydney, Australia,2003, pp. 88–97.

[21] T. Menzies and J.S.D. Stefano, “How good isyour blind spot sampling policy?” in 8th IEEEInternational Symposium on High-AssuranceSystems Engineering (HASE). Tampa, FL, USA:IEEE Computer Society, 2004, pp. 129–138.

[22] M. Lanza, A. Mocci, and L. Ponzanelli, “Thetragedy of defect prediction, prince of empiricalsoftware engineering research,” IEEE Software,Vol. 33, No. 6, 2016, pp. 102–105.

[23] M. Mohri, A. Rostamizadeh, and A. Talwalkar,Foundations of Machine Learning, ser. Adaptivecomputation and machine learning. The MITPress, 2012.

[24] R.C. Martin, Clean Code: A Handbook of AgileSoftware Craftsmanship. Prentice Hall, 2008.

[25] T.J. McCabe, “A complexity measure,” IEEETransactions on Software Engineering, Vol. 2,No. 4, 1976, pp. 308–320.

[26] M.H. Halstead, Elements of Software Science(Operating and Programming Systems Series).New York, NY, USA: Elsevier Science Ltd, 1977.

[27] N.E. Fenton and S.L. Pfleeger, Software Met-rics: A Rigorous and Practical Approach, 2nd ed.Boston, MA, USA: Course Technology, 1998.

[28] C. Ebert and J. Cain, “Cyclomatic complexity,”IEEE Software, Vol. 33, No. 6, 2016, pp. 27–29.

[29] F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou,“Towards building a universal defect predictionmodel,” in 11th Working Conference on Min-ing Software Repositories, MSR, P.T. Devanbu,S. Kim, and M. Pinzger, Eds. Hyderabad, India:ACM, 2014, pp. 182–191.

Page 14: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

212 Bartłomiej Wójcicki, Robert Dąbrowski

[30] S. Kim, “Adaptive bug prediction by analyzingproject history,” Ph.D. dissertation, Universityof California at Santa Cruz, Santa Cruz, CA,USA, 2006, aAI3229992.

[31] S. Kim, T. Zimmermann, K. Pan, and E.J.Whitehead, Jr., “Automatic identification ofbug-introducing changes,” in 21st IEEE/ACMInternational Conference on Automated SoftwareEngineering (ASE). Tokyo, Japan: IEEE Com-puter Society, 2006, pp. 81–90.

[32] T. Gyimóthy, R. Ferenc, and I. Siket, “Empiricalvalidation of object-oriented metrics on open so-urce software for fault prediction,” IEEE Trans-actions on Software Engineering, Vol. 31, No. 10,2005, pp. 897–910.

[33] H. Zhang and X. Zhang, “Comments on ‘Datamining static code attributes to learn defect pre-dictors’,” IEEE Transactions on Software Engi-neering, Vol. 33, No. 9, 2007, pp. 635–637.

[34] T. Menzies, A. Dekhtyar, J.S.D. Stefano, andJ. Greenwald, “Problems with precision: A re-sponse to ‘Comments on data mining static codeattributes to learn defect predictors’,” IEEETransactions on Software Engineering, Vol. 33,No. 9, 2007, pp. 637–640.

[35] M.A. Hall and G. Holmes, “Benchmarking at-tribute selection techniques for discrete classdata mining,” IEEE Transactions on Knowledgeand Data Engineering, Vol. 15, No. 6, 2003, pp.1437–1447.

[36] F. Shull, V.R. Basili, B.W. Boehm, A.W. Brown,P. Costa, M. Lindvall, D. Port, I. Rus,R. Tesoriero, and M.V. Zelkowitz, “What wehave learned about fighting defects,” in 8thIEEE International Software Metrics Symposium(METRICS). Ottawa, Canada: IEEE ComputerSociety, 2002, p. 249.

[37] S. Kim, T. Zimmermann, E.J. Whitehead, Jr.,and A. Zeller, “Predicting faults from cachedhistory,” in 29th International Conference onSoftware Engineering (ICSE 2007). Minneapolis,MN, USA: IEEE, 2007, pp. 489–498.

[38] F. Rahman, D. Posnett, A. Hindle, E.T. Barr,and P.T. Devanbu, “BugCache for inspections:hit or miss?” in SIGSOFT/FSE’11 19th ACMSIGSOFT Symposium on the Foundations ofSoftware Engineering (FSE-19) and ESEC’11:13rd European Software Engineering Conference(ESEC-13), T. Gyimóthy and A. Zeller, Eds.Szeged, Hungary: ACM, 2011, pp. 322–331.

[39] C. Lewis, Z. Lin, C. Sadowski, X. Zhu, R. Ou,and E.J.W. Jr., “Does bug prediction supporthuman developers? Findings from a Google casestudy,” in 35th International Conference onSoftware Engineering, ICSE, D. Notkin, B.H.C.Cheng, and K. Pohl, Eds. San Francisco, CA,USA: IEEE / ACM, 2013, pp. 372–381.

[40] N. Nagappan and T. Ball, “Use of relative codechurn measures to predict system defect density,”in 27th International Conference on SoftwareEngineering (ICSE), G. Roman, W.G. Griswold,and B. Nuseibeh, Eds. ACM, 2005, pp. 284–292.

[41] A.E. Hassan, “Predicting faults using the com-plexity of code changes,” in 31st InternationalConference on Software Engineering, ICSE. Van-couver, Canada: IEEE, 2009, pp. 78–88.

[42] E. Giger, M. Pinzger, and H.C. Gall, “Compar-ing fine-grained source code changes and codechurn for bug prediction,” in Proceedings of the8th International Working Conference on Min-ing Software Repositories, MSR, A. van Deursen,T. Xie, and T. Zimmermann, Eds. ACM, 2011,pp. 83–92.

[43] S.R. Chidamber and C.F. Kemerer, “Towardsa metrics suite for object oriented design,” inSixth Annual Conference on Object-OrientedProgramming Systems, Languages, and Applica-tions (OOPSLA ’91), A. Paepcke, Ed. Phoenix,Arizona, USA: ACM, 1991, pp. 197–211.

[44] R. Shatnawi, “Empirical study of fault predictionfor open-source systems using the Chidamberand Kemerer metrics,” IET Software, Vol. 8,No. 3, 2014, pp. 113–119.

[45] N.V. Chawla, N. Japkowicz, and A. Kotcz, “Edi-torial: Special issue on learning from imbalanceddata sets,” ACM SIGKDD Explorations Newslet-ter, Vol. 6, No. 1, Jun. 2004, pp. 1–6.

[46] R. Wu, H. Zhang, S. Kim, and S. Cheung, “Re-Link: recovering links between bugs and changes,”in SIGSOFT/FSE’11 19th ACM SIGSOFT Sym-posium on the Foundations of Software Engineer-ing (FSE-19) and ESEC’11: 13rd European Soft-ware Engineering Conference (ESEC-13), T. Gy-imóthy and A. Zeller, Eds. Szeged, Hungary:ACM, 2011, pp. 15–25.

[47] T. Gyimóthy and A. Zeller, Eds., SIGSOFT-/FSE 11 Proceedings of the 19th ACM SIGSOFTSymposium on Foundations of Software Engi-neering. Szeged, Hungary: ACM, 2011.

Page 15: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

Applying Machine Learning to Software Fault Prediction 213

Appendix

The implementation of the method used in this study for predictor evaluation is outlined below, itcan be used to reproduce results of the experiments.

1 # imports available on github.com2 import git3 import numpy as np4 from sklearn import cross_validation5 from sklearn import metrics6 from sklearn import naive_bayes7 from sklearn import utils8 from scary import dataset9 from scary import evaluation

10

11 def run():12 projects = [13 "path/to/flask",14 "path/to/odoo",15 "path/to/GitPython",16 "path/to/ansible",17 "path/to/grab",18 ]19 classifier = naive_bayes.GaussianNB()20 EvaluationRunner(projects, classifier).evaluate()21

22 class EvaluationRunner:23 def __init__(self, projects, classifier, from_revision="HEAD~100", to_revision="HEAD",24 shuffle_times=10, folds=10):25 self.projects = projects26 self.classifier = classifier27 self.from_revision = from_revision28 self.to_revision = to_revision29 self.shuffle_times = shuffle_times30 self.folds = folds31

32 def evaluate(self):33 total_score_manager = self.total_score_manager()34 for project in self.projects:35 project_score_manager = self.project_score_manager()36 training_set = self.build_training_set(project)37 for data, target in self.shuffled_training_sets(training_set):38 predictions = self.cross_predict(data, target)39 confusion_matrix = self.confusion_matrix(predictions, target)40 total_score_manager.update(confusion_matrix)41 project_score_manager.update(confusion_matrix)42 self.report_score(project, project_score_manager)43 self.report_score("TOTAL", total_score_manager)44

45 def project_score_manager(self):46 return ScoreManager.project_score_manager()47

48 def total_score_manager(self):49 return ScoreManager.total_score_manager()

Page 16: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

214 Bartłomiej Wójcicki, Robert Dąbrowski

50

51 def build_training_set(self, project):52 repository = git.Repo(project)53 return dataset.TrainingSetBuilder.build_training_set(repository,54 self.from_revision, self.to_revision)55

56 def shuffled_training_sets(self, training_set):57 for _ in range (self.shuffle_times):58 yield utils.shuffle(training_set.features, training_set.classes)59

60 def cross_predict(self, data, target):61 return cross_validation.cross_val_predict(self.classifier, data, target,62 cv=self.folds)63

64 def confusion_matrix(self, predictions, target):65 confusion_matrix = metrics.confusion_matrix(target, predictions)66 return evaluation.ConfusionMatrix(confusion_matrix)67

68 def report_score(self, description, score_manager):69 print(description)70 score_manager.report()71

72 class ScoreManager:73 def __init__(self, counters):74 self.counters = counters75

76 def update (self, confusion_matrix):77 for counter in self.counters:78 counter.update(confusion_matrix)79

80 def report(self):81 for counter in self.counters:82 print(counter.description, counter.score)83

84 @classmethod85 def project_score_manager(cls):86 counters = [MeanScoreCounter(RecallCounter),87 MeanScoreCounter(FalsePositiveRateCounter),]88 return cls(counters)89

90 @classmethod91 def total_score_manager(cls):92 counters = [RecallCounter(),93 FalsePositiveRateCounter(),]94 return cls(counters)95

96 class BaseScoreCounter:97 def update (self, confusion_matrix):98 raise NotImplementedError99

100 @property101 def score (self):102 raise NotImplementedError103

Page 17: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

Applying Machine Learning to Software Fault Prediction 215

104 @property105 def decription(self):106 raise NotImplementedError107

108 class MeanScoreCounter(BaseScoreCounter):109 def __init__(self, partial_counter_class):110 self.partial_counter_class= partial_counter_class111 self.partial_scores = []112

113 def update(self, confusion_matrix):114 partial_score = self.partial_score(confusion_matrix)115 self.partial_scores.append(partial_score)116

117 def partial_score(self, confusion_matrix):118 partial_counter = self.partial_counter_class()119 partial_counter.update(confusion_matrix)120 return partial_counter.score121

122 @property123 def score (self):124 return np.mean(self.partial_scores), np.std(self.partial_scores)125

126 @property127 def description(self):128 return "mean␣{}".format(self.partial_counter_class().description)129

130 class RecallCounter(BaseScoreCounter):131 def __init__(self):132 self.true_positives = 0133 self.false_negatives = 0134

135 def update(self, confusion_matrix):136 self.true_positives += confusion_matrix.true_positives137 self.false_negatives += confusion_matrix.false_negatives138

139 @property140 def score(self):141 return self.true_positives/(self.true_positives+self.false_negatives)142

143 @property144 def description(self):145 return "recall"146

147 class FalsePositiveRateCounter(BaseScoreCounter):148 def __init__(self):149 self.false_positives = 0150 self.true_negatives = 0151

152 def update (self, confusion_matrix):153 self.false_positives += confusion_matrix.false_positives154 self.true_negatives += confusion_matrix.true_negatives155

156 @property157 def score (self):

Page 18: Applying Machine Learning to Software Fault …...For fault prediction problem, false positive rate is a fraction of defect-free code units that are incorrectly labeled as fault-prone.

216 Bartłomiej Wójcicki, Robert Dąbrowski

158 return self.false_positives/(self.false_positives+self.true_negatives)159

160 @property161 def description(self):162 return "false␣positive␣rate"163

164 if __name__ == "__main__":165 run ()