Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael...
-
Upload
vincent-chapman -
Category
Documents
-
view
219 -
download
0
Transcript of Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael...
Scalable StatisticalBug Isolation
Ben Liblit, Mayur Naik, Alice Zheng,Alex Aiken, and Michael Jordan, 2005
University of Wisconsin, Stanford University, and UC Berkeley
Mustafa Dajani27 Nov 2006 CMSC 838P
Overview of the Paper explained a statistical debugging algorithm that is able to isolate bugs in programs containing multiple undiagnosed bugs
showed a practical, scalable algorithm for isolating multiple bugs in many software systems 1
outline:
IntroductionBackgroundCause Isolation AlgorithmExperiments
Objective of the Study: To develop a statistical algorithm to
hunt for causes of failures
• Crash reporting systems are useful in collecting data
• Actual executions are a vast resource
• Using feedback data for causes of failures
Introduction• Statistical debugging - a dynamic analysis for detecting the causes of run failures.
- an instrumentation program basically monitor program behavior by sampling information
- this involves testing of predicates in particular events during the run
• Predicates, P - bug predictors; large programs may consist of thousands of predicates
• Feedback Report, R - contains information whether a run has succeeded or failed.
Introduction• the study’s model of behavior:
“If P is observed to be true at least once during run R then R(P) = 1, otherwise R(P) = 0.”
- In other words, it counts how often “P observed true” and “P observed” using random sampling
• previous works involved the use of regularized logistic regression (it tries to select predicates to determine outcome of every run)
- but this algorithm creates redundancy in finding predicates as well as difficulty in predicting multiple bugs
Introduction•
Study design:- determine all possible predicates
- eliminate predicates that have no predictive power
- loop {- rank the surviving predicates by
importance- remove all top-ranked predicates- discard all runs where the run passed,
R(P)=1- go to top of loop until set of runs or set of
predicates are empty
Bug Isolation Architecture
ProgramSource
Compiler
Sampler
Predicates
ShippingApplication
Counts& /
€ƒƒ
€‚Statistical
Debugging
Top bugs withlikely causes
Depicting failures through P
F(P)F(P) + S(P)
Failure(P) =
F(P) = # of failures where P observed trueS(P) = # of successes where P observed true
Predicting P’s truth or falsehood
F(P observed)F(P observed) + S(P observed)
Context(P) =
F(P observed) = # of failures observing PS(P observed) = # of successes observing P
Notes
• Two predicates are redundant if they predict the same or nearly the same set of failing ones
• Because of elimination is iterative, it is only necessary that Importance selects a good predictor at each step and not necessarily the best one.
Guide to Visualization
Increase(P)
S(P)
error bound
log(F(P) + S(P))
Context(P)
http://www.cs.wisc.edu/~liblit/pldi-2005/
Rank by Increase(P)
• High Increase() but very few failing runs!• These are all sub-bug predictors
– Each covers one special case of a larger bug
• Redundancy is clearly a problem
http://www.cs.wisc.edu/~liblit/pldi-2005/
Rank by F(P)
• Many failing runs but low Increase()!• Tend to be super-bug predictors
– Each covers several bugs, plus lots of junk
http://www.cs.wisc.edu/~liblit/pldi-2005/
Notes
• In the language of information retrieval– Increase(P) has high precision, low recall– F(P) has high recall, low precision
• Standard solution:– Take the harmonic mean of both– Rewards high scores in both dimensions
http://www.cs.wisc.edu/~liblit/pldi-2005/
Rank by Harmonic Mean
• It works!– Large increase, many failures, few or no
successes
• But redundancy is still a problemhttp://www.cs.wisc.edu/~liblit/pldi-2005/
Lessons Learned
• Can learn a lot from actual executions– Users are running buggy code anyway– We should capture some of that
information
• Crash reporting is a good start, but…– Pre-crash behavior can be important– Successful runs reveal correct behavior– Stack alone is not enough for 50% of bugs
http://www.cs.wisc.edu/~liblit/pldi-2005/