[IEEE 2009 10th Latin American Test Workshop - Rio de Janeiro, Brazil (2009.03.2-2009.03.5)] 2009...

Exploring Machine Learning Techniques for Fault Localization

Luciano C. Ascari Lucilia Y. Araki Aurora R.T. Pozo Silvia R. Vergilio

Federal University of Parana (UFPR)

Computer Science Department, CP: 19081, CEP 19031-970

Centro Politecnico, Jardim das Americas, Curitiba- Brazil

Abstract

Debugging is the most important task related to the test-ing activity. It has the goal of locating and removing a faultafter a failure occurred during test. However, it is not atrivial task and generally consumes effort and time. Debug-ging techniques generally use testing information but usu-ally they are very specific for certain domains, languagesand development paradigms. Because of this, a Neural Net-work (NN) approach has been investigated with this goal.It is independent of the context and presented promising re-sults for procedural code. However it was not validated inthe context of Object-Oriented (OO) applications. In addi-tion to this, the use of other Machine Learning techniquesis also interesting, because they can be more efficient. Withthis in mind, the present work adapts the NN approach tothe OO context and also explores the use of Support VectorMachines (SVMs). Results from the use of both techniquesare presented and analysed. They show that their use con-tributes for easing the fault localization task.

1. Introduction

Debugging is the task of locating and removingfaults [4]. It starts once a failure has been occurred. Becauseof this, debugging is seen as consequence of successful test-ing. Although debugging can and should be an orderly pro-cess, it is still very much an art [19]. Debugging frequentlydepends on the software engineer experience to search andvalidate hypothesis about the problem. This task is far fromtrivial because the manifestation of the error (failure) andits internal cause (fault) may have no obvious relationship.This fact makes the task of locating faults very expensiveand time consuming and, fundamental for debugging.

Failures can occur in diverse phases of the developmentprocess, such as after: codification, system integration test-ing and maintenance. Because of this, there are differenttechniques and tools with different goals. In this work,

our objective is to reduce effort spent in the fault localiza-tion task, after software testing. At this point, a completesoftware was implemented, and in addition to the code andspecification, test information is also available.

We find in the literature diverse techniques that use test-ing information for debugging [1, 2, 8, 9, 10, 11, 12, 15, 16,20, 26, 28, 30, 31, 32] and at least one commercial tool isavailable [23]. However, they present some limitations: 1)in most case, they use only trivial information about testing,for example, whether the test case reveals or not faults; 2)high costs in terms of debugging time; 3) some of them of-fer a lot of testing information about control and data flowsof the program, that may confuse the debugger; 4) in mostcases, they are not scalable for large programs and can notbe used in practice; and 5) most of them can be appliedonly in specific contexts, and are dependent on the language(usually procedural languages, such as C).

On the other hand, a very different approach was intro-duced by Wong et al [29]. They proposed an approach thatuses information about the testing coverage of the state-ments in the program. The testing results and the state-ments covered by the test cases are used to train a back-propagation neural network (NN). After the training, thenetworks show possible relationship between a statementin a program and a failure. Based on this result, suspi-cious statements can be ranked according to its risk of beingfaulty. This information and the slices execution of failedtests are combined to improve the efficacy of the method.They conducted experiments with seven C programs andobtained very promising results, comparing with other ap-proaches.

The approach proposed by Wong et al [29] does not de-pend on a particular model. The coverage data used can beobtained in different contexts. However, the approach wasvalidated by considering the testing criteria all-statementsonly in the context of procedural code and nowadays theObject Oriented (OO) paradigm is the most popular andused.

In this kind of paradigm, the class methods are smaller,

978-1-4244-4206-5/09/$25.00 ©2009 IEEE

and the challenge is on the class testing that includes theintegration testing of all methods in a class. A failure gen-erally occurs when the methods are integrated and could becaused by a combination of several faults. Because of thesedifferent aspects and due to the importance of OO applica-tions, this work explores the use of Machine Learning (ML)techniques in the OO testing fault localization context.

Using only coverage data, we adapt the approach ofWong et al, based on NN, for locating faults in OO appli-cations. The use of Support Vector Machines (SVMs) withthe same goal and methodology is also investigated. TheNN approach, and specially, the multilayer networks canrepresent general nonlinear functions, but are very hard totrain because of the abundance of local minima and the highdimensionality of the weight space. On the other hand, theSVMs use a more efficient training algorithm and can rep-resent complex, nonlinear functions too.

We present results from both techniques that are used toproduce a rank of possible faulty methods by using infor-mation about the methods of each class covered by the testcases. The results obtained are presented and analyzed.

The remaining sections are arranged as follows. Section2 introduces related work and describes in detail the NNapproach and methodology introduced by Wong et al andused in our work. Section 3 describes the used applicationsand how the ML techniques were applied in the context ofOO software. Section 4 presents and discusses the obtainedresults. Section 5 concludes the paper.

2. Related Work

In the literature, there are many works that explorethe use of design metrics to predict faulty methods andclasses [3, 5, 6, 7, 21, 22, 24]. Some of them use Ma-chine Learning Techniques, such as Neural Networks andBayesian networks [13, 14, 17, 18, 24, 33]. By using de-sign metrics these works try to estimate the probability of aclass being faulty with the goal of improving project beforeimplementation.

Our work has a different goal: to find the fault after im-plementation and testing, and ease debuging. The approachproposed by Wong et al [29] has a similar objective and usetest information, such as coverage data, for fault localiza-tion. This approach will be adopted in our study and will bedescribed in detail in this section.

The work of Wong et al considers as input a matrix Mof results obtained from the testing activity as illustrated inTable 1. Each row contains the statement coverage (1 meansthe statement is covered and 0 means not covered) and theresult of each test case (1 means a failure and 0 otherwise).For example, statement s6 is covered by test cases t1, t2, t3,t5, t6 and t7; t7 covers statements s1, s2, s3, s4, s6, s8 ands9 and reveals a fault when executed (column rt). In each

Table 1. Matrix of Test Coverage and Execu-tion Results (adapted from [29])

s1 s2 s3 s4 s5 s6 s7 s8 s9 rt

t1 1 1 1 1 0 1 0 0 1 0t2 1 0 0 0 1 1 1 1 0 0t3 0 0 0 0 0 1 1 0 0 0t4 1 1 0 0 1 0 1 1 1 0t5 1 1 1 0 1 1 1 1 1 0t6 0 0 1 0 0 1 1 1 0 1t7 1 1 1 1 0 1 0 1 1 1

row i, we obtained the coverage vector of test case i (cti).Besides M , a matrix representing virtual test cases in the

format of the matrix represented in Figure 1 is used. Thesevirtual test cases v1, v2, ..., vm are associated to the cov-erage vectors cv1, cv2, ..., cvm where each test case coversonly a statement si. They are named virtual, because we cannot find them in the real word. If the execution of vi fails,the probability that the fault is in si is extremely high andwe should first examine the statements whose correspond-ing virtual test case fails.

Figure 1. Virtual test cases format [29]

In a first step, a network is trained by using cti and rti

(for i = 1, 2, ...,m, obtained from Table 1) as the input dataand the corresponding expected output data, respectively.In the second step cvi (Figure 1) is used as input data tothe trained network to obtain the outputs r

′vi. At the end

the statements si are ranked in descending order based onr′vi, and examined one by one from the top until the fault is

located.Wong et al also use an approach to reduce the number

of suspicious statements by executing slices of failed tests.The slices contain the set of statements executed by a testcases. The reduction considers that the fault should be cov-ered by failed tests, or at least related to the statements cov-ered by failed tests. This implies that the most suspiciousstatements are the statements covered by all the failed exe-cutions. This is considered to rearranged the rank and elim-inate some statements.

3. Experiment

This section describes the experiment conducted for ap-plying the approach described in last section in the con-

text of OO applications by using NN and SVM techniques.Two different applications were used. Similar to the relatedwork, for each application a matrix M of results obtainedfrom the testing activity was used as the training set for bothMachine Learning techniques. In the OO context, each lineof the matrix corresponds to the coverage of a test case andeach column (attributes) is related to one method of a class.

The Machine Learning techniques were applied throughthe Weka Framework [25]. Weka is a collection of Ma-chine Learning algorithms for data mining tasks. Wekacontains tools for data pre-processing, classification, re-gression, clustering, association rules, and visualization. Itis also well-suited for developing new Machine Learningschemes. The applications and the methodology used todiscover the suspicious methods are explained next.

3.1. Applications

Two Java applications were used. The first one (A1) is anagenda system, this system organizes an agenda of events,schedules, etc. The second one is an image processingapplication, developed by students in a regular graduationcourse. A1 and A2 with, respectively, 44 and 32 methods.Some faults found during the development were chosen andre-implanted in each version. In this way, each faulty ver-sion has only one fault. A total of 8 faulty versions wereused for each application. The test cases were randomlygenerated. For A1, two sets of test cases were used with dif-ferent sizes (A1-a with 500 test cases and A1-b with 3000).For A2, only one set with 500 test cases was generated. Ta-ble 2 presents a summary of characteristics of each usedfaulty version. The columns show the number of test casesthat revealed (effective) or not the fault (ineffective) intro-duced in each version.

3.2. Machine Learning Algorithms

The chosen NN was a Multi-layer neural network withthe backpropagation training algorithm. The trainingdatasets were submitted to the Multi-layer neural algorithmusing the default parameters of Weka (Table 3). WekaFramework implements John Platt’s sequential minimal op-timization algorithm for training a support vector classifier.To obtain proper probability estimates, the user must set theoption that fits logistic regression models to the outputs ofthe support vector machine (SVM), the other parameters areset to default values (See Table 3).

3.3. Followed Steps

The following steps were taken, according to Wong etal’s approach (Section 2).

Table 2. Characteristics of the Used Applica-tions

Application no. of effective no. of ineffectivetest cases test cases

A1-a Version 1a 407 93A1-a Version 2a 440 60A1-a Version 3a 385 115A1-a Version 4a 342 158A1-a Version 5a 451 49A1-a Version 6a 483 17A1-a Version 7a 399 101A1-a Version 8a 288 212A1-b Version 1b 2720 280A1-b Version 2b 2796 204A1-b Version 3b 2624 376A1-b Version 4b 2490 510A1-b Version 5b 2726 274A1-b Version 6b 2978 22A1-b Version 7b 2669 331A1-b Version 8b 2337 663

A2 Version 1 8 492A2 Version 2 36 464A2 Version 3 105 395A2 Version 4 16 484A2 Version 5 38 462A2 Version 6 40 460A2 Version 7 100 400A2 Version 8 100 400

1. Instrumentation of each application (and correspon-dent versions) to produce the trace of each execution.

2. Implementation of a script to automatically captureand compare the outputs of the application and cor-respondent faulty versions.

3. Automatic execution of the applications and faulty ver-sions: the results were stored in files and a matrix suchas that presented in Table 1 was obtained. In such ma-trix the columns represent the covered methods.

4. Application of the Machine Learning algorithms byusing the obtained matrices.

5. Creation of the virtual test sets: where each test casescovers only one method.

6. Estimation of the most suspicious methods by usingthe matrix of virtual test cases as input to the trainednetworks and results of Step 4.

The program using the MultiLayerPerceptron (MP) wasexecuted ten times, using the options seed parameters (0to 9), in this way we have different initial weights for theneural network. The suspicious methods were classifiedin decreasing order based in the probabilities returned bythe algorithms. The results for each application and test set(A1-a, A1-b e A2) are presented respectively in Tables 4, 5

and 6. The entries of these tables represent the positionof the faulty method indicated for each faulty version andeach used seed. For example, Table 4 shows that the faultymethod of faulty version 1 of A1-a was ordered in the firstposition, except when the MP-seed 2 was used. In such casethe method was ordered in fourth position. In the last rowof these tables, 1 means that the faulty version is on the topof SVM list; ’-’ means that SVM gives no answer.

3.4. Analysis of the Results

We can note from Tables 4, 5 and 6 that, for some ver-sions, the results of applying the Machine Learning tech-niques is very good. The faulty methods are in the top ofthe rank. However, for other versions the results seem notso good, the faulty methods are ranked in last positions.

When we consider NN technique, better results are ob-tained with A2, and we observe a non-influence of the testcase set size. It seems that the results are influenced by thenumber of test cases that revealed the corresponding fault.

The SVM results are clear, when the algorithm convergesto one solution, the faulty method is in the top, otherwisethe SVM do not converge and no answer is given. Thishappens in most cases for application A1. The observedfactors should be better investigated in future experimentswith other kind of applications.

Figure 2. A1 Results, 500 test cases


As mentioned before, most existent approaches and toolsare not directly applied to OO code. To better evaluate if the


neural network is an advantageous method, a comparativerandom fault localization approach was created. For eachversion, a ranking was randomly produced. This rankingwas compared with the neural network ranking. BecauseSVM did not converge for many versions it was not consid-ered in this comparison.

The Wilcoxon rank sum test [27] with continuity cor-rection was applied. For all cases, the Wilcoxon test con-firms that all the distributions are different and that the NNapproach is better than the random one. These results aresummarized on Figures 2, 3 and 4.

The figures show the box-and-whisker plots of the givengrouped values. It is possible to note that the ranking posi-tion for the faulty method obtained by the NN algorithm isalways lower than the obtained with the random approach;the mean values are better for the NN approach too. Thisshows that the NN approach is the best strategy for faultlocalization in our applications, compared with the randomone.

4. Conclusions

This paper explored two ML techniques, NN and SVM,for fault localization in OO applications, according to themethodology proposed by Wong et al [29].

Both techniques were used to produce a rank of possiblefaulty methods. To apply the techniques we used informa-tion about the testing activity: the methods covered (exer-cised) by a test case and the test result (that is, whether thetest case produced or not a failure). To produce the rank, aset of virtual test cases was provided. In this set each testcase ideally covers (exercises) only a method.

The results show that many times the faulty method is onthe top of the NN rank. In some cases the NN results are notgood. On the other hand, the faulty version is always in thetop of SVM rank, when this rank can be obtained, becauseSVM did not converge for many times.

Comparing NN approach with a random one, the faultversions are always better classified in the NN rank than inthe random rank. This was confirmed by the Wilcoxon test.

Table 3. Weka Parameters

Multi-layer neural algorithmLearning Rate 0.3Momentum 0.2Training Time 500Validation Set Size 0Random Seed 0Validation Threshold 20checksTurnedOff False

Support vector machine (SVM)buildLogisticModels Truethe complexity parameter C 1.0debug Falseepsilon 1.0 E-12filterType set to normalize training datakernel Polykernel -c 250007 -E10numFolds -1randomSeed 1toleranceParameter 0.0010

Table 4. A1-a Results with the set of 500 test casesVersion 1 Version 2 Version 3 Version 4 Version 5 Version 6 Version 7 Version 8

MP Seed: 0 1 1 36 43 26 25 34 1MP Seed: 1 1 1 23 40 6 36 6 1MP Seed: 2 4 1 8 16 9 37 40 4MP Seed: 3 1 1 7 43 8 30 37 3MP Seed: 4 1 1 9 44 5 37 32 4MP Seed: 5 1 1 5 40 3 27 33 4MP Seed: 6 1 1 8 8 4 25 38 1MP Seed: 7 1 1 18 39 9 28 42 1MP Seed: 8 1 1 17 41 2 15 40 2MP Seed: 9 1 1 37 44 2 39 1 1

SVM 1 - - - - - - -

Table 5. A1-b Results with the set of 3000 test casesVersion 1 Version 2 Version 3 Version 4 Version 5 Version 6 Version 7 Version 8


SVM 1 - - - - - - -

Table 6. A2 Results with the set of 500 test casesVersion 1 Version 2 Version 3 Version 4 Version 5 Version 6 Version 7 Version 8


SVM - 1 - 1 1 1 - -

This shows that the NN based strategy is the best, amongthe studied.

An advantage of NN and SVM approach is that they areindependent of the implementation language and develop-ment paradigm. It can be used as complementary (before)to an ad-hoc strategy, contributing to reduce efforts of thetesting activity by reducing the number of analysed meth-ods in the fault localization task. However, other studiesshould be conducted with other applications to better evalu-ate the SVM approach, as well as other ML techniques and,the characteristics of the applications that can influence inthe ML performance.

References

[1] H. Agrawal, J. Alberi, J. R. Horgan, J. Li, S. London, W. E.Wong, S. Ghosh, and N. Wilde. Mining system tests to aidsoftware maintenance. IEEE Computer, 31(7):64–73, 1998.

[2] H. Agrawal, J. R. Horgan, S. London, and W. E. Wong. Faullocalization using execution slices and dataflow tests. 6th In-ternational Symposium on Software Reliability Engineering,pages 143–151, 1995.

[3] M. Alshayeb and W. Li. An empirical validation of object-oriented metrics in two different iterative software processes.IEEE Trans. Softw. Eng., 29(11):1043–1049, 2003.

[4] K. Araki, Z. Furukawa, and J. Cheng. A general frameworkfor debugging. IEEE Software, 8(3):14–20, 1991.

[5] V. R. Basili, L. C. Briand, and W. L. Melo. A validation ofobject-oriented design metrics as quality indicators. IEEETrans. Softw. Eng., 22(10):751–761, 1996.

[6] L. C. Briand, J. Wust, J. Daly, and V. Porter. A compre-hensive empirical validation of design measures for object-oriented systems. In METRICS ’98: Proceedings of the5th International Symposium on Software Metrics, page 246,Washington, DC, USA, 1998. IEEE Computer Society.

[7] L. C. Briand, J. Wust, J. W. Daly, and D. V. Porter. Explor-ing the relationships between design measures and softwarequality in object-oriented systems. The Journal of Systemsand Software, 51(3):245–273, 2000.

[8] T. W. Chan. A framework for debugging. Journal of Com-puter Information Systems, 38(1):67–73, 1997.

[9] T. Y. Chen and Y. Y. Cheung. On program dicing. Journalof Software Maintenance, 9(1):33–46, 1997.

[10] H. Cleve and A. Zeller. Locating causes of program failures.27th Int. Conference on Software Engineering, pages 342–351, 2005.

[11] J. S. Collofello and L. Cousins. Toward automatic softwarefault localization through decision-to-decisions path analy-sis. AFIP 1987 National Computer Conference, pages 539–544, 1987.

[12] R. A. DeMillo, H. Pan, and E. H. Sfford. Critical slicing forsoftware fault localization. ACM SIGSOFT Int. Symposiumon Software Testing and Analysis, pages 121–134, 1996.

[13] n. Elena Perez-Mi and J.-J. Gras. Improving fault predictionusing bayesian networks for the development of embeddedsoftware applications: Research articles. Softw. Test. Verif.Reliab., 16(3):157–174, 2006.

[14] N. Fenton, M. Neil, W. Marsh, P. Hearty, D. Marquez,P. Krause, and R. Mishra. Predicting software defects in

varying development lifecycles using bayesian nets. Inf.Softw. Technol., 49(1):32–43, 2007.

[15] P. Fritzson, N. Shahmehri, M. Kamkar, and T. Gyimothy.Generalized algorithmic debugging and testing. ACM Let-ters on Programming Languages and Systems, 1(4):303–322, 1992.

[16] B. Korel and J. Laski. Dinamic program slicing. InformationProcesing Letters, 29(3):155–163, 1988.

[17] H. Lounis and L. Ait-Mehedine. Machine-learning tech-niques for software product quality assessment. In QSIC’04: Proceedings of the Quality Software, Fourth Interna-tional Conference, pages 102–109, Washington, DC, USA,2004. IEEE Computer Society.

[18] G. J. Pai and J. B. Dugan. Empirical analysis of softwarefault content and fault proneness using bayesian methods.IEEE Transaction on Software Engineering, 33(10):675–686, October 2007.

[19] R. Pressman. Software engineering: A practitioner approach.2006.

[20] J. R.Lyle and M. Weiser. Automatic program bug locationby program slicing. II Int. Conference on Computers andApplications, pages 877–883, 1987.

[21] R. Subramanyam and M. S. Krishnan. Empirical analysisof CK metrics for object-oriented design complexity: Im-plications for software defects. IEEE Trans. Softw. Eng.,29(4):297–310, 2003.

[22] G. Succi, W. Pedrycz, M. Stefanovic, and J. Miller. Practicalassessment of the models for identification of defect-proneclasses in object-oriented commercial systems using designmetrics. J. Syst. Softw., 65(1):1–12, 2003.

[23] T. Technologies. xSuds Toolsuite. 1998.http://xsuds.argreenhouse.com.

[24] M. M. T. Thwin and T.-S. Quah. Application of neural net-works for software quality prediction using object-orientedmetrics. J. Syst. Softw., 76(2):147–156, 2005.

[25] U. Waikato. Weka - machine lerning softwarein Java. University of Waikato. Available onhttp://www.cs.waikato.ac.nz/ml/weka, 2007.

[26] M. Weiser. Program slicing. IEEE Transactions on SoftwareEngineering., 10(4):352–357, 1984.

[27] F. Wilcoxon. Individual comparisons by ranking methods.Biometrics, 1:80–83, 1995.

[28] W. E. Wong, J. R. Jorgan, S. S. Gokhale, and K. S. Trivedi.Locating program features using execution slices. IEEESymposium on Applications-Specific Systems and SoftwareEngineering and Technology, pages 194–203, 1999.

[29] W. E. Wong, L. Zhao, Y. Qi, K. Cai, and J. Dong. Effec-tive fault localization using bp neural networks. In SoftwareEngineering and Knowledge Engineering Conference, pages374–379, 2007.

[30] A. Zeller. Isolating cause-effect chains from computerprograms. ACM SIGSOFT Software Engineering Notes,27(6):1–10, 2002.

[31] X. Zhang and R. Gupta. Cost effective dynamic programslicing. ACM SIGPLAN 2004 Conference on ProgrammingLanguage Design and Implementation, pages 94–106, 2004.

[32] X. Zhang, R. Gupta, and Y. Zhang. Precise dynamic slic-ing algorithms. XXV International Conference on SoftwareEngineering - ICSE’03, pages 319–329, 2003.

[33] Y. Zhou and H. Leung. Empirical analysis of object-orienteddesign metrics for predicting high and low severity faults.IEEE Transaction on Software Engineering, 32(10):771–789, October 2006.

[IEEE 2009 10th Latin American Test Workshop - Rio de Janeiro, Brazil (2009.03.2-2009.03.5)] 2009...

Documents

Transcript of [IEEE 2009 10th Latin American Test Workshop - Rio de Janeiro, Brazil (2009.03.2-2009.03.5)] 2009...