Mining Unstructured Software Repositories Using IR Models
Transcript of Mining Unstructured Software Repositories Using IR Models
Mining Unstructured Software Repositories
Using IR Models
Stephen W. ThomasPhD Candidate
Queen’s University
BBAA
2
Stephen W. ThomasMining Software Repositories with Topic Models. ICSE 2011
Stephen W. Thomas, Hadi Hemmati, Ahmed E. Hassan, and Dorothea BlosteinStatic Test Case Prioritization Using Topic Models.Empirical Software Engineering, 2012
Stephen W. Thomas, Nicolas Bettenburg, Ahmed E. Hassan, and Dorothea BlosteinTalk and Work: Recovering the Relationship between Mailing List Discussions and Development Activity.Empirical Software Engineering, 2nd roundStephen W. Thomas, Meiyappan Nagappan , Ahmed E. Hassan, and Dorothea BlosteinThe Impact of Classifier Configuration and Classifier Combination on Bug Localization.IEEE Transactions on Software Engineering, 2nd round
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea BlosteinValidating the Use of Topic Models for Software Evolution.SCAM 2010
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea BlosteinModeling the Evolution of Topics in Source Code Histories.MSR 2011
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea BlosteinStudying Software Evolution Using Topic Models.Science of Computer Programming, 2012
code changes
logs bugs
emailreqs
bug prediction
traceability linking
feature location
architecture recovery
change pattern detection
3
00:03:45: E22344, 76, 90.3,00:03:46: E2f3a4, 82, 95.0,00:03:56: E22345, 78, 96.6,00:04:15: E22344, 23, 95.1,00:04:35: E23348, 65, 95.7,00:04:37: E2234b, 56, 93.1,00:04:38: E2234b, 54, 95.0,00:04:39: E22a34, 98, 95.1,00:05:42: E353f4, 65, 94.7,00:05:42: E3556j, 45, 95.2,00:05:42: E3545g, 63, 92.8,00:05:42: E354r4, 94, 95.6,
source code comments
bug reportsemails
requirement descriptions forum and blog postscommit messages
source code identifiers
4
NPE caused by no spashscreen handler service
available
Provide unittests for link creation constraints, unit tests
fail in standalone build
5
Service pricing Confer
6pr
icin
g
Conference Se
rvic
e
7
New!1
23
8
Part
Part
Part
9
The research and practice of using IR models to mine software repositories can be improved by
(i) considering additional software engineering tasks, such as prioritizing test cases;
(ii) using advanced IR techniques, such as combining multiple IR models; and
(iii) better understanding the assumptions and parameters of IR models.
Test Case Prioritization
Less similar
Higher prioritySimilarity
identifierscommentsstring literals
Part 1
10[EMSE 2012]structural-based IR-based
Source code Email Interaction
cleaning andpreprocessing
identifierscommentsstring literals
mail codeXML
printing
installation
GUI
Code
Time
Act
ivity
XML
Monitoring project status
Software explanation
Training and documentation
11
Part 1
[EMSE 20XX]
New!1
23
12
Part
Part
Part
Combining Multiple IR Models identifiers
comments
string literalsBug report
Similaritytitle
description
Best individualIR model
Random subset, combined
13
Part 2
[TSE 20XX] sets had improved performance median improvement
XML concept
Swing concept
Encryption concept
Time
Popu
larit
y
Concept Evolution Models identifierscommentsstring literals
14
Part 2
[SCP 2012][SCAM 2010]
accuracy of topic evolutions
New!1
23
15
Part
Part
Part
Data Duplication Problem
identical
16
Part 3
[MSR 2011] accuracysensitivity
Preprocessing and Parameter EffectsCode representation
identifiers? comments?past bug reports?
Bug report representationtitle? description?
Preprocessingsplit identifiers? remove stop words?
word stemming?
IR Model parametersterm weighting?
No. of topics? similarity measure?No. of iterations?
Configuration matters!
worst:
best:
mean:
17
Part 3
[TSE 20XX]
“configuration”
New!1
23
18
Part
Part
Part
Proposed and evaluated a technique to prioritize test casesProposed and evaluated a technique to analyze the interaction of source code
and mailing lists
Described and evaluated a technique to analyze code histories using topic evolution modelsProposed and evaluated a framework for combining the results of disparate IR models
Overcame the data duplication problem in large source code historiesAnalyzed the sensitivity of IR models to data preprocessing and IR model
parameters