Mining Function Usage Patterns to Find Bugs Chadd Williams.
-
Upload
horatio-washington -
Category
Documents
-
view
223 -
download
4
Transcript of Mining Function Usage Patterns to Find Bugs Chadd Williams.
Mining Function Usage Patterns to Find Bugs
Chadd Williams
2/19 University of Maryland
open(f)tmp = cnt = 0while(cnt < sz & tmp != -1) tmp = read(f,sz) if(tmp != -1) cnt += tmpclose(f)
open(f)tmp = cnt = 0while(cnt < sz & tmp != -1) tmp = read(f,sz) if(tmp != -1) cnt += tmpclose(f)
open(f)tmp = cnt = 0while(cnt < sz & tmp != -1) tmp = read(f,sz) if(tmp != -1) cnt += tmpclose(f)
Thesis
Source code is full of interesting properties– describes how the source code is written– rule that one must adhere to for code to work
correctly– what to do with values from a function– how to use an API
Can we find the properties?– every change is committed– changes highlight misunderstood code
We can discover important properties
by looking at source code changes
Can we use these rules to help the developer to find bugs?
3/19 University of Maryland
Why?
We wrote the code, we know the rules!
Implicit rules build up over time– little or no documentation– failure to understand implicit rules causes
bugs• 32% of bugs detected during maintenance1
How much do you know about your 10 year old code base?– Didn’t someone rewrite the matrix objects?– What about that third party library?
[1] Matsumura, T., Monden, A., Matsumoto, K., The Detection of Faulty Code Violating Implicit Coding Rules, IWPSE ’02
4/19 University of Maryland
Static Analysis
Analysis of code without execution– examine the source code only
Many successful static analysis tools check for violations of system specific rules– how to use an internal API– specialized lock/unlock functionality– data validation requirements
Often produces many false warnings– can historical information improve this?
5/19 University of Maryland
General Technique
Inspect each commit to each file Identify properties in each version Compare sets of properties to
determine new instances of properties
Identify commonly added properties
…value = foo();newPosition + = value; …
…value = foo();if( value != error_code) { newPosition + = value;}…
Commit
6/19 University of Maryland
Evaluation
Does historical information help?– can we get the same value by only looking
at the latest version of the source code?
Metric– are the likely bugs near the top?– cumulative precision
• Precision: number of likely bugs vs. number of warnings inspected
7/19 University of Maryland
Return Value Check Bug
Identify functions whose return value induces a code change
…value = foo();newPosition + = value; // ??? …
…value = foo();if( value != error_code) { // Check newPosition + = value;}…
Tool InferredBug Fix
Apache Results Provide developers a list of sorted warnings– use historical
information for sorting 0
0.2
0.4
0.6
0.8
1
1 6 11 16 21
Warnings Inspected
Cu
mu
lati
ve P
reci
sio
n
Naive Ranking
HistoryAw are Ranking
Chi-square = 6.15p is less than or equal to 0.025
8/19 University of Maryland
Discovering Function Usage Patterns
Function Usage Pattern– describe function invocations with respect
to each other• static analysis • intraprocedural
– describe relationships between functions• implicit rules
mdi = HeapAlloc(GetProcessHeap());if (!mdi) HeapFree(GetProcessHeap(), 0, cs);
HDC hdc = BeginPaint( hwnd, &ps );if( hdc ) DrawIcon( hdc, x, y, hIcon );EndPaint( hwnd, &ps );
9/19 University of Maryland
Goals
Discover valid patterns– use data mining techniques to identify
patterns
Identify buggy patterns– which patterns commonly cause a code
change
Find violations of these patterns– static analysis– use history to rank violations
10/19 University of Maryland
Mining Changes in Function Usage Patterns
Find new instances of patterns– where that instance was not found in the
revision immediately prior
This finds a large number of patterns– need context to strengthen the ties between
the pair of functions– Data Flow
new instance of the pattern open() -> read()
int foo(){ open(); }
int foo(){ open(); read();}
Commit
11/19 University of Maryland
Data Flow
Identify data flow relationships between function pairs– produced/consume– use same data– update same data
HDC hdc = BeginPaint( hwnd, &ps );if( hdc ) DrawIcon( hdc, x, y, hIcon );EndPaint( hwnd, &ps );
HDC hdc = BeginPaint( hwnd, &ps );if( hdc ) DrawIcon( hdc, x, y, hIcon );EndPaint( hwnd, &ps );
HDC hdc = BeginPaint( hwnd, &ps );if( hdc ) hdc.x = genX();
Data flow confidence– what percent of new
instances of foo() -> bar() have a data flow relationship?
12/19 University of Maryland
Bug-Prone Patterns How does a new instance enter the
source code– both of the function calls were added– one function call was added
• the added function completed the pairing• bug fix? refactoring?
Bug confidence– what percent of new instances of foo()-
>bar() are created by adding one function call?
int foo(){ }
int foo(){ open(); read();}
Commit
int foo(){ open(); read(); close();}
Commit
And which function call is most likely to be added?
13/19 University of Maryland
Valid, Bug Prone Patterns
Patterns added completely could indicate valid patterns
Patterns added by adding one function call indicate:– refactoring/very misunderstood pattern– random noise
Which are likely to be buggy?
Two Function Calls Added One Function Call Added
Two Function Calls Added One Function Call Added
14/19 University of Maryland
Ranking of Violations
Number of violations for each pattern– experience from the current code base
Data Flow Confidence– which are valid patterns
Bug Confidence– which have caused code changes in the
past
Confidence– how often, when foo() is added, is foo()-
>bar() created
15/19 University of Maryland
Preliminary Results
Student Projects – CS 3– Introduction to C– CVS history for each student for each
project• CVS commit to see automated test
results– 50% precision on final submission
Apache web server– 50% precision rate top 10 warnings– identified a refactoring
WineTREEVIEW_ValidItem(tree,item);
TREEVIEW_SendTreeviewNotify(tree,command,item);
16/19 University of Maryland
Apache Case Study
1,129 C source files– includes modules– Apache Portable Runtime
41,000 CVS commits– 6,000 compilable CVS transactions that
change source files for the Linux version
Studied httpd-2.0 branch– July 1996 through Oct 2003– some files have history back through 1.0
branch
17/19 University of Maryland
Apache Refactoring
Found many patterns of this form:
Thu Nov 18 23:07:53 1999 UTC (6 years, 3 months ago)
… I then changed all the fprintf(stderr calls to ap_log_error …
Function 1 Function 2 Bug Confidence Add Second Function
shmcb_get_safe_uint ap_log_error 1.0 1.0
ssl_util_vhostid ap_log_error 0.8 1.0
Change debug logging– previously printf– now ap_log_error or ap_log_rerror
Change debug logging– previously printf– now ap_log_error or ap_log_rerror
How often is this pattern created
by adding exactly one function call
How often, when one function call is
added to create this pattern, is it
the second function call
18/19 University of Maryland
Can we find bugs?
Static analysis to identify violations of ap_log_error patterns– 16 of first 20 warnings are likely bugs
• first 20 warnings involving ap_log_error
– ranking based on• violations per pattern• bug confidence• data flow confidence
Why do these bugs exist?– missed refactorings – bugs caused by not knowing implicit rules
This refactoring started in 1999
19/19 University of Maryland
Conclusions
Interesting properties can be mined from change history– function usage patterns
Using historical information has improved static analysis tools– provide a list of ranked warnings to user– reduced false positive rate