Mining Unstructured Software Repositories Using IR Models

18
Mining Unstructured Software Repositories Using IR Models Stephen W. Thomas PhD Candidate Queen’s University B B A A

Transcript of Mining Unstructured Software Repositories Using IR Models

Page 1: Mining Unstructured Software Repositories Using IR Models

Mining Unstructured Software Repositories

Using IR Models

Stephen W. ThomasPhD Candidate

Queen’s University

BBAA

Page 2: Mining Unstructured Software Repositories Using IR Models

2

Stephen W. ThomasMining Software Repositories with Topic Models. ICSE 2011

Stephen W. Thomas, Hadi Hemmati, Ahmed E. Hassan, and Dorothea BlosteinStatic Test Case Prioritization Using Topic Models.Empirical Software Engineering, 2012

Stephen W. Thomas, Nicolas Bettenburg, Ahmed E. Hassan, and Dorothea BlosteinTalk and Work: Recovering the Relationship between Mailing List Discussions and Development Activity.Empirical Software Engineering, 2nd roundStephen W. Thomas, Meiyappan Nagappan , Ahmed E. Hassan, and Dorothea BlosteinThe Impact of Classifier Configuration and Classifier Combination on Bug Localization.IEEE Transactions on Software Engineering, 2nd round

Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea BlosteinValidating the Use of Topic Models for Software Evolution.SCAM 2010

Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea BlosteinModeling the Evolution of Topics in Source Code Histories.MSR 2011

Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea BlosteinStudying Software Evolution Using Topic Models.Science of Computer Programming, 2012

Page 3: Mining Unstructured Software Repositories Using IR Models

code changes

logs bugs

emailreqs

bug prediction

traceability linking

feature location

architecture recovery

change pattern detection

3

Page 4: Mining Unstructured Software Repositories Using IR Models

00:03:45: E22344, 76, 90.3,00:03:46: E2f3a4, 82, 95.0,00:03:56: E22345, 78, 96.6,00:04:15: E22344, 23, 95.1,00:04:35: E23348, 65, 95.7,00:04:37: E2234b, 56, 93.1,00:04:38: E2234b, 54, 95.0,00:04:39: E22a34, 98, 95.1,00:05:42: E353f4, 65, 94.7,00:05:42: E3556j, 45, 95.2,00:05:42: E3545g, 63, 92.8,00:05:42: E354r4, 94, 95.6,

source code comments

bug reportsemails

requirement descriptions forum and blog postscommit messages

source code identifiers

4

Page 5: Mining Unstructured Software Repositories Using IR Models

NPE caused by no spashscreen handler service

available

Provide unittests for link creation constraints, unit tests

fail in standalone build

5

Page 6: Mining Unstructured Software Repositories Using IR Models

Service pricing Confer

6pr

icin

g

Conference Se

rvic

e

Page 7: Mining Unstructured Software Repositories Using IR Models

7

Page 8: Mining Unstructured Software Repositories Using IR Models

New!1

23

8

Part

Part

Part

Page 9: Mining Unstructured Software Repositories Using IR Models

9

The research and practice of using IR models to mine software repositories can be improved by

(i) considering additional software engineering tasks, such as prioritizing test cases;

(ii) using advanced IR techniques, such as combining multiple IR models; and

(iii) better understanding the assumptions and parameters of IR models.

Page 10: Mining Unstructured Software Repositories Using IR Models

Test Case Prioritization

Less similar

Higher prioritySimilarity

identifierscommentsstring literals

Part 1

10[EMSE 2012]structural-based IR-based

Page 11: Mining Unstructured Software Repositories Using IR Models

Source code Email Interaction

cleaning andpreprocessing

identifierscommentsstring literals

mail codeXML

printing

installation

GUI

Code

Mail

Time

Act

ivity

XML

Monitoring project status

Software explanation

Training and documentation

11

Part 1

[EMSE 20XX]

Page 12: Mining Unstructured Software Repositories Using IR Models

New!1

23

12

Part

Part

Part

Page 13: Mining Unstructured Software Repositories Using IR Models

Combining Multiple IR Models identifiers

comments

string literalsBug report

Similaritytitle

description

Best individualIR model

Random subset, combined

13

Part 2

[TSE 20XX] sets had improved performance median improvement

Page 14: Mining Unstructured Software Repositories Using IR Models

XML concept

Swing concept

Encryption concept

Time

Popu

larit

y

Concept Evolution Models identifierscommentsstring literals

14

Part 2

[SCP 2012][SCAM 2010]

accuracy of topic evolutions

Page 15: Mining Unstructured Software Repositories Using IR Models

New!1

23

15

Part

Part

Part

Page 16: Mining Unstructured Software Repositories Using IR Models

Data Duplication Problem

identical

16

Part 3

[MSR 2011] accuracysensitivity

Page 17: Mining Unstructured Software Repositories Using IR Models

Preprocessing and Parameter EffectsCode representation

identifiers? comments?past bug reports?

Bug report representationtitle? description?

Preprocessingsplit identifiers? remove stop words?

word stemming?

IR Model parametersterm weighting?

No. of topics? similarity measure?No. of iterations?

Configuration matters!

worst:

best:

mean:

17

Part 3

[TSE 20XX]

“configuration”

Page 18: Mining Unstructured Software Repositories Using IR Models

New!1

23

18

Part

Part

Part

Proposed and evaluated a technique to prioritize test casesProposed and evaluated a technique to analyze the interaction of source code

and mailing lists

Described and evaluated a technique to analyze code histories using topic evolution modelsProposed and evaluated a framework for combining the results of disparate IR models

Overcame the data duplication problem in large source code historiesAnalyzed the sensitivity of IR models to data preprocessing and IR model

parameters