Assisting Code Search with Automatic Query Reformulation for Bug Localization
-
Upload
bunyamin-sisman -
Category
Technology
-
view
285 -
download
0
description
Transcript of Assisting Code Search with Automatic Query Reformulation for Bug Localization
Assisting Code Search with Automatic Query
Reformulation for Bug Localization
Bunyamin Sisman & Avinash [email protected] | [email protected]
Purdue University
MSR’13
Outline
I. Motivation
II. Past Work on Automatic Query Reformulation
III. Proposed Approach: Spatial Code Proximity (SCP)
IV. Data Preparation
V. Results
VI. Conclusions
MSR'13
MSR'13
Main Motivation
It is extremely common for software developers to use arbitrary
abbreviations & concatenations
in software. These are generally difficult to predict when searching the code base of a project.
The question is “Is there some way to automatically reformulate a user’s query so that all such relevant terms are also used in retrieval?”
MSR'13
We show how a query can be automatically reformulated for superior retrieval accuracy
We propose a new framework for Query Reformulation, which leverages the spatial proximity of the terms in files
The approach leads to significant improvements over the baseline and the competing Query Reformulation approaches
Summary of Our Contribution
MSR'13
Our approach preserves or improves the retrieval accuracy for 76% of the 4,393 bugs we analyzed for Eclipse and Chrome projects
Our approach improves the retrieval accuracy for 42% of the 4,393 bugs
Improvements are 66% for Eclipse and 90% for Chrome in terms of MAP (Mean Average Precision)
We also describe the conditions under which Query Reformulation may perform poorly.
Summary of Our Contribution
MSR'13
Query Reformulation with Relevance Feedback
1. Perform an initial retrieval with the original query
2. Analyze the set of top retrieved documents vis-à-vis the query
3. Reformulate the query
MSR'13
Acquiring Relevance Feedback
Implicitly: infer feedback from user interactions
Explicitly: user provides feedback [Gay et al. 2009]
Pseudo Relevance Feedback (PRF): Automatic QR
This is our work!
MSR'13
Data Flow in the Proposed Retrieval Framework
MSR'13
Automatic Query Reformulation
No user involvement!
It takes less than a second to reformulate a query on ordinary desktop hardware!
It is cheap!
It is effective!
MSR'13
Previous Work on Automatic QR (for Text Retrieval)
Rocchio’s Formula (ROCC)
Relevance Model (RM)
MSR'13
The Proposed Approach to QR:Spatial Code Proximity (SCP)
Spatial Code Proximity is an elegant approach to giving greater weights to terms in source code that occur in the vicinity of the terms in a users’ query
Proximities may be created through commonly used concatenations
Punctuation characters
Camel Casing etc…
Underscores: tab_strip_gtk
Camel casing: kPinnedTabAnimationDurationMs
MSR'13
Spatial Code Proximity (SCP) (Cont’d)
Tokenize source files and index the positions of the terms in each source file:
Use the distance between terms to find relevant terms vis-à-vis a query!
SCP: Bringing the Query into the Picture
MSR'13
Example Query: “Browser Animation”
First perform an initial retrieval with the original query
Increase the weights of those nearby terms!
MSR'13
Research Questions
Question 1: Does the proposed QR approach improve the accuracy of source code retrieval. If so, to what extent?
Question 2: How do the QR techniques that are currently in the literature perform for source code retrieval?
Question 3: How does the initial retrieval performance affect the performance of QR?
Question 4: What are the conditions under which QR may perform poorly?
MSR'13
Data Preparation
For evaluation, we need a set of queries and the relevant files
We use the titles of the bug reports as queries
We have to link the repository commits to the bug tracking database! Used regular expressions to detect Bug Fix
commits based on commit messages
MSR'13
Data Preparation (Cont’d)
Eclipse v3.1 Chrome v4.0
#Bugs 4,035 358
Avg. # Relevant Files
2.76 3.82
Avg. #Commits 1.36 1.23
1https://engineering.purdue.edu/RVL/Database/BUGLinks/
Resulting dataset: BUGLinks1
MSR'13
Evaluation Framework
We use Precision and Recall based metrics to evaluate the retrieval accuracy.
Determine the query sets for which the proposed QR approaches lead to
1. improvements in the retrieval accuracy
2. degradation in the retrieval accuracy
3. no change in the retrieval accuracy
Analyze these sets to understand the characteristics of the queries each set contains
MSR'13
Evaluation Framework (Cont’d)
For comparison of these sets, we used the following Query Performance Prediction (QPP) metrics [Haiduc et al. 2012, He et al. 2004]:
Average Inverse Document Frequency (avgIDF)
Average Inverse Collection Term Frequency (avgICTF)
Query Scope (QS)
Simplified Clarity Score (SCS)
Additionally, we analyzed
Query Lengths
Number of Relevant files per bug
MSR'13
QR with Bug Report Titles
#Improve
d
#Worsened
0
500
1000
1500
2000
ROCC
RM
SCP (Proposed)
#Bugs
ROCC RM SCP (Proposed)
MSR'13
Improvements in Retrieval Accuracy (% Increase in MAP)
Eclipse Chrome0%
2000%
4000%
6000%
8000%
10000%
ROCC
RM
SCP (Proposed)
ROCC RM SCP (Proposed)
MSR'13
Conclusions & Future WorkOur framework can use a weak initial query
as a jumping off point for a better query.
No user input is necessary
We obtained significant improvements over the baseline and the well-known Automatic QR methods.
Future Work includes evaluation of different term proximity metrics in source code for QR
MSR'13
References
[1] B. Sisman and A. Kak, “Incorporating version histories in information retrieval based bug localization,” in Proceedings of the 9th Working Conference on Mining Software Repositories (MSR’12). IEEE, 2012, pp. 50–59
[2] G. Gay, S. Haiduc, A. Marcus, and T. Menzies, “On the use of relevance feedback in IR-based concept location,” in International Conference on Software Maintenance (ICSM’09), sept. 2009, pp. 351 –360.
[3] A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “An information retrieval approach to concept location in source code,” in Proceedings of the 11th Working Conference on Reverse Engineering (WCRE’04). IEEE Computer Society, 2004, pp. 214–223
MSR'13
References
[4] S. Haiduc, G. Bavota, R. Oliveto, A. De Lucia, and A. Marcus, “Automatic query performance assessment during the retrieval of software artifacts,” in Proceedings of the 27th International Conference on Automated Software Engineering (ASE’12) . ACM, 2012, pp. 90–99
[5] B. He and I. Ounis, “Inferring query performance using pre-retrieval predictors,” in Proc. Symposium on String Processing and Information Retrieval . Springer Verlag, 2004, pp. 43–54