A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman...

A Comparative Evaluation of Static Analysis Actionable Alert Identification TechniquesSarah Heckman and Laurie WilliamsDepartment of Computer ScienceNorth Carolina State University

Motivation• Automated static analysis can find a large number of

alerts – Empirically observed alert density of 40 alerts/KLOC[HW08]

• Alert inspection required to determine if developer should (and could) fix– Developer may only fix 9%[HW08] to 65%[KAY04] of alerts– Suppose 1000 alerts – 5 minute inspection per alert – 10.4

work days to inspect all alerts– Potential savings of 3.6-9.5 days by only inspecting alerts the

developer will fix• Fixing 3-4 alerts that could lead to field failures justifies

the cost of static analysis[WDA08]

PROMISE 2013 (c) Sarah Heckman 2

Coding Problem?• Actionable: alerts the developer wants to fix

– Faults in the code– Conformance to coding standards– Developer action: fix the alert in the source code

• Unactionable: alerts the developer does not want to fix – Static analysis false positive– Developer knowledge that alert is not a problem– Inconsequential coding problems (style)– Fixing the alert may not be worth effort– Developer action: suppress the

alert


Actionable Alert Identification Techniques• Supplement automated static analysis

– Classification: predict actionability– Prioritization: order by predicted actionability

• AAIT utilize additional information about the alert, code, and other artifacts– Artifact Characteristics

• Can we determine a “best” AAIT?


Research Objective• to inform the selection of an actionable alert

identification technique for ranking the output of automated static analysis through a comparative evaluation of six actionable alert identification techniques.


Related Work• Comparative evaluation of AAIT [AAH12]

– Languages: Java and Smalltalk– ASA: PMD, FindBugs, SmallLint– Benchmark: FAULTBENCH– Evaluation Metrics

• Effort – “average number of alerts one must inspect to find an actionable one”

• Fault Detection Rate Curve – number of faults detected against number alerts inspected.

– Selected AAIT: APM, FeedbackRank, LRM, ZRanking, ATL-D, EFindBugs


Comparative Evaluation• Considered AAIT in literature [HW11][SFZ11]

• Selection Criteria– AAIT classify or prioritize alerts generated by

automated static analysis for the Java programming language

– An implementation of the AAIT is described allowing for replication

– The AAIT is fully automated and does not require manual intervention or inspection of alerts as part of the process


Selected AAIT (1)• Actionable Prioritization Models (APM) [HW08]

– ACs: code location, alert type• Alert Type Lifetime (ATL) [KE07a]

– AC: alert type lifetime– ATL-D: measures the lifetime in days– ATL-R: measures the lifetime in revisions

• Check ‘n’ Crash (CnC) [CSX08]

– AC: test failures– Generates tests that try to cause RuntimeExceptions


Selected AAIT (2)• History-Based Warning Prioritization (HWP) [KE07b]

– ACs: commit messages that identify fault/non-fault fixes

• Logistic Regression Models (LRM) [RPM08]

– ACs: 33 including two proprietary/internal AC• Systematic Actionable Alert Identification (SAAI)

[HW09]

– ACs: 42– Machine learning


FAULTBENCH v0.3• 3 Subject Programs: jdom, runtime, logging• Procedure

1. Gather Alert and Artifact Characteristic Data Sources

2. Artifact Characteristic and Alert Oracle Generation3. Training and Test Sets4. Model Building5. Model Evaluation


Gather Data• Download from repo• Compile• ASA – FindBugs & Check ‘n’ Crash (ESC/Java)• Source Metrics – JavaNCSS• Repository History – CVS & SVN• Difficulties

– Libraries – changed over time– Not every revision would build (especially early ones)


Artifact CharacteristicsIndependent Variables Alert Identifier and History

• Alert information (type, location)• Number of alert modifications

Source Code Metrics• Size and complexity metrics

Source Code History• Developers• File creation, deletion, and modification

revisionsSource Code Churn

• Added and deleted lines of codeAggregate Characteristics

• Alert lifetime, alert counts, staleness

Dependent Variable – Alert Classification


Alert Info

Surrounding Code

Alert

Actionable Alert

Unactionable Alert

Alert Oracle Generation


• Iterate through all revisions, starting with the earliest, and compare alerts between revisions

• Closed Actionable• Filtered Unactionable• Deleted• Open

– Inspection– All unactionable

Open

Deleted

Closed

Filtered

Training and Test Sets• Simulate how AAIT would be used in practice• Training set: first X% of revisions to train the models

– 70%, 80%, and 90%• Test set: use remaining 100-X% of revisions to test the

models• Overlapping alerts

– Alerts open at the cutoff revision• Deleted alerts

– If an alert is deleted, the alert is not considered UNLESS the alert isn’t deleted in the training set. In that case the alert is used in model building.


Model Building & Model Evaluation


• Classification Statistics:– Precision = TP / (TP + FP)– Recall = TP / (TP + FN)– Accuracy = (TP + TN) / (TP + TN + FP + FN)

Predicted Actual

True Positive (TP) Actionable Actionable

False Positive (FP) Actionable Unactionable

False Negative (FN) Unactionable Actionable

True Negative (TN) Unactionable Unactionable

• All AAIT are built using the training data and evaluated by predicting the actionability of the test data

Results - jdom


Accuracy (%) Precision (%) Recall (%)Rev. 70 80 90 70 80 90 70 80 90APM 80 83 87 46 42 0 9 10 0ATL-D 72 83 88 26 20 20 22 2 3ATL-R 77 81 86 32 24 24 11 8 13CnC 73 80 95 100 100 0 6 9 0HWP 31 35 32 19 15 9 73 67 57LRM 72 76 83 37 35 32 64 55 59SAAI 83 86 90 92 100 67 16 13 7

Results - runtime


Accuracy (%)

Precision (%) Recall (%)

AAIT 70 80 90 70 80 90 70 80 90APM 36 23 50 88 70 47 32 17 57ATL-D 18 17 55 92 82 100 8 4 3ATL-R 34 43 59 93 94 55 27 36 60HWP 68 66 46 88 85 45 74 73 83LRM 88 87 53 88 87 49 100 100 100SAAI 49 65 83 90 91 100 48 66 63

Results - logging Accuracy (%) Precision (%) Recall (%)AAIT 70 80 90 70 80 90 70 80 90APM 85 89 92 0 0 0 0 0 0ATL-D 92 97 100 0 0 0 0 0 0ATL-R 92 97 100 0 0 0 0 0 0CnC 67 100 100 0 0 0 0 0 0HWP 32 35 33 8 4 0 100 100 0LRM 77 84 83 25 14 0 100 100 0SAAI 90 97 100 0 0 0 0 0 0


Threats to Validity• Internal Validity

– Automation of data generation, collection, and artifact characteristic generation

– Alert oracle – uninspected alerts are considered unactionable– Alert closure is not an explicit action by the developer– Alert continuity not perfect

• Close and open a new alert if both the line number and source hash of the alert change

– Number of revisions• External Validity

– Generalizability of results– Limitations of the AAIT in comparative evaluation

• Construct Validity– Calculations for artifact characteristics


Future Work• Incorporate additional projects into FAULTBENCH

– Emphasis on adding projects that actively use ASA and include filter files

– Allow for evaluation of AAIT with different goals• Identification of most predictive artifact

characteristics• Evaluate different windows for generating test

data– A full project history may not be as predictive as the

most recent history


Conclusions• SAAI found to be the best overall model when

considering accuracy– Highest accuracy, or tie, for 6 of 9 treatments

• ATL-D, ATL-R, and LRM were also predictive when considering accuracy– CnC also performed well, but only considered alerts

from one ASA• LRM and HWP had the highest recall


References[AAH12] S. Allier, N. Anquetil, A. Hora, S. Ducasse, “A Framework to Compare Alert Ranking Algorithms,” 2012 19th

Working conference on Reverse Engineering, Kingston, Ontario, Canada, October 15-18, 2012, p. 277-285.[CSX08] C. Csallner, Y. Smaragdakis, and T. Xie, "DSD-Crasher: A Hybrid Analysis Tool for Bug Finding," ACM

Transactions on Software Engineering and Methodology, vol.17, no. 2, pp. 1-36, April, 2008.[HW08] S. Heckman and L. Williams, "On Establishing a Benchmark for Evaluating Static Analysis Alert Prioritization

and Classification Techniques," Proceedings of the 2nd International Symposium on Empirical Software Engineering and Measurement, Kaiserslautern, Germany, October 9-10, 2008, pp. 41-50.

[HW09] S. Heckman and L. Williams, "A Model Building Process for Identifying Actionable Static Analysis Alerts," Proceedings of the 2nd IEEE International Conference on Software Testing, Verification and Validation, Denver, CO, USA, 2009, pp. 161-170.

[HW11] S. Heckman and L. Williams, "A Systematic Literature Review of Actionable Alert Identification Techniques for Automated Static Code Analysis," Information and Software Technology, vol. 53, no. 4, April 2011, p. 363-387.

[KE07a] S. Kim and M. D. Ernst, "Prioritizing Warning Categories by Analyzing Software History," Proceedings of the International Workshop on Mining Software Repositories, Minneapolis, MN, USA, May 19-20, 2007, p27.

[KE07b] S. Kim and M. D. Ernst, "Which Warnings Should I Fix First?," Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Dubrovnik, Croatia, September 3-7, 2007, pp. 45-54.

[KAY04] T. Kremenek, K. Ashcraft, J. Yang, and D. Engler, "Correlation Exploitation in Error Ranking," Proceedings of the 12th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Newport Beach, CA, USA, 2004, pp. 83-93.

[RPM08] J. R. Ruthruff, J. Penix, J. D. Morgenthaler, S. Elbaum, G. Rothermel, “Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach,” Proceedings of the 30th International Conference on Software Engineering, Leipzig, Germany, May 10-18, 2008, pp. 341-350.

[SFZ11] H. Shen, J. Fang, J. Zhao, “EFindBugs: Effective Error Ranking for FindBugs,” 2011 IEEE 4th International Conference on Software Testing, Verification and Validation, Berlin, Germany, March 21-25, 2011, p. 299-308.

[WDA08] S. Wagner, F. Deissenboeck, M. Aichner, J. Wimmer, M. Schwalb, “An Evaluation of Two Bug Pattern Tools for Java,” Proceedings of the 1st International Conference on Software Testing, Verification, and Validation, …


A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman...

Documents

Transcript of A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman...