130614 sebastiano panichella - mining source code descriptions from developers communications

34
Mining Source Code Descriptions from Developer Communications Sebastiano Jairo Massimiliano Andrian Gerardo Panichella Aponte Di Penta Marcus Canfora

description

Software mining, source code, developers, e-mails

Transcript of 130614 sebastiano panichella - mining source code descriptions from developers communications

Page 1: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Mining Source Code Descriptions from Developer Communications

Sebastiano Jairo Massimiliano Andrian Gerardo Panichella Aponte Di Penta Marcus Canfora

Page 2: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Context: Software Project

Documentation

Source Code

Developer

Class diagram

Sequence diagram Program

Comprehension

Maintenance Tasks

Page 3: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Context: Software Project

Documentation

Source Code

Developer

understanding

Class diagram

Sequence diagram Program

Comprehension

Difficult

Maintenance Tasks

Page 4: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Context: Software Project

Documentation

Source Code

Developer

understanding

describes

Class diagram

Sequence diagram Program

Comprehension

understanding Difficult

Maintenance Tasks

Page 5: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Source Code

Developer

Coming back to the reality...

Context: Software Project

Program Comprehension

Maintenance Tasks

understanding Difficult

Page 6: 130614   sebastiano panichella -  mining source code descriptions from developers communications

We argue that messages exchanged among contributors/developers are a useful source of information to help understanding source code.

Idea

In such situations developers need to infer knowledge from,

the source code itself source code descriptions in external artifacts.

Developer

Page 7: 130614   sebastiano panichella -  mining source code descriptions from developers communications

We argue that messages exchanged among contributors/developers are a useful source of information to help understanding source code.

Idea

In such situations developers need to infer knowledge from,

the source code itself source code descriptions in external artifacts.

Developer

..................................................

When call the method IndexSplitter.split(File

destDir, String[] segs) from the Lucene cotrib

directory(contrib/misc/src/java/org/apache/luc

ene/index) it creates an index with segments

descriptor file with wrong data. Namely wrong

is the number representing the name of segment

that would be created next in this index.

..................................................

CLASS: IndexSplitter METHOD: split

Page 8: 130614   sebastiano panichella -  mining source code descriptions from developers communications

A Five Step-Approach for Mining Method Descriptions

Developer

Page 9: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Step 1: Downloading emails/bugs reports and tracing them onto classes

Two heuristics

The discussion contains a fully-qualified class name (e.g., org.apache.lucene.analysis.MappingCharFilter); or the email contains a file name (e.g., MappingCharFilter.java)

For bug reports, we complement the above heuristic by matching the bug ID of each closed bug to the commit notes, therefore tracing the bug report to the files changed in that commit

Developer Discussion

When call the method .split(File destDir, String[] segs) from the

Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates

an index with segments descriptor file with wrong data. Namely wrong is the number

representing the name of segment that would be created next in this index.

public void split(File destDir, String[] segs) throws IOException {

destDir.mkdirs();

FSDirectory destFSDir = FSDirectory.open(destDir);

SegmentInfos destInfos = new SegmentInfos }

If some of the segments of the index already has this name this results either to

impossibility to create new segment or in crating of an corrupted segment.

IndexSplitter

Page 10: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Step 1: Downloading emails/bugs reports and tracing them onto classes

Two heuristics

The discussion contains a fully-qualified class name (e.g., org.apache.lucene.analysis.MappingCharFilter); or the email contains a file name (e.g., MappingCharFilter.java)

For bug reports, we complement the above heuristic by matching the bug ID of each closed bug to the commit notes, therefore tracing the bug report to the files changed in that commit

Developer Discussion

When call the method .split(File destDir, String[] segs) from the

Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates

an index with segments descriptor file with wrong data. Namely wrong is the number

representing the name of segment that would be created next in this index.

public void split(File destDir, String[] segs) throws IOException {

destDir.mkdirs();

FSDirectory destFSDir = FSDirectory.open(destDir);

SegmentInfos destInfos = new SegmentInfos }

If some of the segments of the index already has this name this results either to

impossibility to create new segment or in crating of an corrupted segment.

CLASS: IndexSplitter

IndexSplitter

Page 11: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Step 2: Extracting paragraphs

Two heuristics

We consider as paragraphs, text section separated by one or more white lines

We prune out paragraph description from source code fragments and/or stack Traces "by using an approach inspired by the work of Bacchelli et al.

Developer Discussion

When call the method IndexSplitter.split(File destDir, String[] segs) from the

Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates

an index with segments descriptor file with wrong data. Namely wrong is the number

representing the name of segment that would be created next in this index.

public void split(File destDir, String[] segs) throws IOException {

destDir.mkdirs();

FSDirectory destFSDir = FSDirectory.open(destDir);

SegmentInfos destInfos = new SegmentInfos }

If some of the segments of the index already has this name this results either to

impossibility to create new segment or in crating of an corrupted segment.

PAR2

PAR3

PAR1

Page 12: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Step 2: Extracting paragraphs

Two heuristics

We consider as paragraphs, text section separated by one or more white lines

We prune out paragraph description from source code fragments and/or stack Traces "by using an approach inspired by the work of Bacchelli et al.

Developer Discussion

When call the method IndexSplitter.split(File destDir, String[] segs) from the

Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates

an index with segments descriptor file with wrong data. Namely wrong is the number

representing the name of segment that would be created next in this index.

public void split(File destDir, String[] segs) throws IOException {

destDir.mkdirs();

FSDirectory destFSDir = FSDirectory.open(destDir);

SegmentInfos destInfos = new SegmentInfos }

If some of the segments of the index already has this name this results either to

impossibility to create new segment or in crating of an corrupted segment.

PAR2

PAR3

PAR1

Page 13: 130614   sebastiano panichella -  mining source code descriptions from developers communications

When call the method IndexSplitter.split(File destDir, String[] segs) from the Lucene cotrib directory it creates an index with segments

descriptor file with wrong data. Namely wrong is the number

representing the name of segment that would be created next in this

index.

......................................................................................

......................................................................................

......................................................................................

......................................................................................

Step 3: Tracing paragraphs onto methods

These paragraphs must

respect the following

two conditions:

A) A valid paragraph must contain the keyword “method”

B) and the method name must be followed by a open parenthesis— i.e., we match “foo(”

Developer Discussion

PAR1

CLASS: IndexSplitter

METHOD: split(

A) B)

Page 14: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Step 4: Heuristic based Filtering

We defined a set of heuristics to further filter the paragraphs associated with

methods that assign each paragraph a score:

..........................

Problem seems to come from

MainMethodeSearchEngine in org.eclipse.jdt.internal.ui.launcher

The Method

searchMainMethods

,there's

a call to addSubTypes(List,

IProgressMonitor, IJavaSearchScope)

Method if includesSubtypes flag is

ON. This method add all types sub-

types as soon as the given scope

encloses them without testing if

sub-types have a main! After return

IType[] before the excecution

..........................

CLASS: MainMethodSearchEngine

(IProgressMonitor,

IJavaSearchScope, boolean)

METHOD: serachMainMethods

SCORE

Page 15: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Step 4: Heuristic based Filtering

We defined a set of heuristics to further filter the paragraphs associated with

methods that assign each paragraph a score:

a) Method parameters: % of parameters s1= mentioned in the paragraphs. Value between 0 and 1

1 if the method does not

have parameters

..........................

Problem seems to come from

MainMethodeSearchEngine in org.eclipse.jdt.internal.ui.launcher

The Method

searchMainMethods

,there's

a call to addSubTypes(List,

IProgressMonitor, IJavaSearchScope)

Method if includesSubtypes flag is

ON. This method add all types sub-

types as soon as the given scope

encloses them without testing if

sub-types have a main! After return

IType[] before the excecution

..........................

CLASS: MainMethodSearchEngine

(IProgressMonitor,

IJavaSearchScope, boolean)

METHOD: serachMainMethods % parameter = 100% -> s1= 1

SCORE

Page 16: 130614   sebastiano panichella -  mining source code descriptions from developers communications

a) Method parameters: % of parameters s1= mentioned in the paragraphs. Value between 0 and 1 b) Syntactic descriptions (mentioning return values): check whether the paragraph contains the s2= keyword “return”. If YES Value equal 1, 0 otherwise

1 if the method does not

have parameters

Equal to one if the method is

void.

..........................

Problem seems to come from

MainMethodeSearchEngine in org.eclipse.jdt.internal.ui.launcher

The Method

searchMainMethods

,there's

a call to addSubTypes(List,

IProgressMonitor, IJavaSearchScope)

Method if includesSubtypes flag is

ON. This method add all types sub-

types as soon as the given scope

encloses them without testing if

sub-types have a main! After

IType[] before the excecution

..........................

CLASS: MainMethodSearchEngine

METHOD: serachMainMethods

SCORE

(IProgressMonitor,

IJavaSearchScope, boolean)

return

1+

% parameter = 100% -> s1= 1

=

Step 4: Heuristic based Filtering

We defined a set of heuristics to further filter the paragraphs associated with

methods that assign each paragraph a score:

Page 17: 130614   sebastiano panichella -  mining source code descriptions from developers communications

a) Method parameters: % of parameters s1= mentioned in the paragraphs. Value between 0 and 1 b) Syntactic descriptions (mentioning return values): check whether the paragraph contains the s2= keyword “return”. If YES Value equal 1, 0 otherwise

1 if the method does not

have parameters

Equal to one if the method is

void.

c) Overriding/Overloading: 1 if any of the “overload” or s3=“override” keywords appears in the paragraph, 0 otherwise

d) Method invocations: 1 if any of the “call” or s4=“excecute” keywords appears in the paragraph, 0 otherwise

..........................

Problem seems to come from

MainMethodeSearchEngine in org.eclipse.jdt.internal.ui.launcher

The Method

searchMainMethods

,there's

a to addSubTypes(List,

IProgressMonitor, IJavaSearchScope)

Method if includesSubtypes flag is

ON. This method add all types sub-

types as soon as the given scope

encloses them without testing if

sub-types have a main! After

IType[] before the

..........................

CLASS: MainMethodSearchEngine

METHOD: serachMainMethods

SCORE =

return

1+

(IProgressMonitor,

IJavaSearchScope, boolean)

excecution

call

0+ 1

% parameter = 100% -> s1= 1

= 2

Step 4: Heuristic based Filtering

We defined a set of heuristics to further filter the paragraphs associated with

methods that assign each paragraph a score:

Page 18: 130614   sebastiano panichella -  mining source code descriptions from developers communications

We selected paragraphs that have: 1. s1 ≥ thP = 0.5

2. s2 + s3 + s4 ≥ thH = 1

SCORE = 1+ 0+ 1

% parameter = 100% -> s1= 1 ≥ 0.5

= 2 ≥ 1

a) Method parameters: % of parameters s1= mentioned in the paragraphs. Value between 0 and 1 b) Syntactic descriptions (mentioning return values): check whether the paragraph contains the s2= keyword “return”. If YES Value equal 1, 0 otherwise

1 if the method does not

have parameters

Equal to one if the method is

void.

c) Overriding/Overloading: 1 if any of the “overload” or s3=“override” keywords appears in the paragraph, 0 otherwise

d) Method invocations: 1 if any of the “call” or s4=“execute” keywords appears in the paragraph, 0 otherwise

Step 4: Heuristic based Filtering

We defined a set of heuristics to further filter the paragraphs associated with

methods that assign each paragraph a score:

OK

Page 19: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Step 5: Similarity based Filtering

We rank filtered paragraphs through their textual similarity with the method they are likely describing.

Removing: - English stop words; - Programming language keywords Using: - Camel Case splitting the on remaining words - Vector Space Model

METHOD PARAGRAPH SCORE Similarity

Method_3 Paragraph_4 2.5 96.1%

Method_1 Paragraph_1 2.5 95.6%

Method_2 Paragraph_2 1.5 97.4%

Method_3 Paragraph_3 1.5 86.2%

Method_1 Paragraph_3 1.5 79.0%

Method_3 Paragraph_2 1.5 77.5%

Method_2 Paragraph_4 1.5 64.3%

Method_2 Paragraph_3 1.3 83.2%

Method_3 Paragraph_1 1.3 73.9%

Method_2 Paragraph_1 1.3 68.7%

Method_1 Paragraph_4 1.3 53.6%

Page 20: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Step 5: Similarity based Filtering

We rank filtered paragraphs through their textual similarity with the method they are likely describing.

Removing: - English stop words; - Programming language keywords Using: - Camel Case splitting the on remaining words - Vector Space Model

METHOD PARAGRAPH SCORE Similarity

Method_3 Paragraph_4 2.5 96.1%

Method_1 Paragraph_1 2.5 95.6%

Method_2 Paragraph_2 1.5 97.4%

Method_3 Paragraph_3 1.5 86.2%

Method_1 Paragraph_3 1.5 79.0%

Method_3 Paragraph_2 1.5 77.5%

Method_2 Paragraph_4 1.5 64.3%

Method_2 Paragraph_3 1.3 83.2%

Method_3 Paragraph_1 1.3 73.9%

Method_2 Paragraph_1 1.3 68.7%

Method_1 Paragraph_4 1.3 53.6%

th>=0.80

Page 21: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Empirical Study • Goal: analyze source code descriptions in developer

discussions

• Purpose: investigating how developer discussions describe methods of Java Source Code

• Quality focus: find good method description in messages exchanged among contributors/developers

• Context: Bug-report and mailing lists of two Java Project Apache Lucene and Eclipse

Page 22: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Context

Page 23: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Research Questions RQ1 (method coverage): How many methods from

the analyzed software systems are described by the paragraphs identified by the proposed approach?

RQ2 (precision): How precise is the proposed approach

in identifying method descriptions?

RQ3 (missing descriptions): How many potentially

good method descriptions are missed by the approach?

Page 24: 130614   sebastiano panichella -  mining source code descriptions from developers communications

RQ1: How many methods from the analyzed software systems are described by the paragraphs identified by the proposed approach?

Page 25: 130614   sebastiano panichella -  mining source code descriptions from developers communications

RQ1: How many methods from the analyzed software systems are described by the paragraphs identified by the proposed approach?

Page 26: 130614   sebastiano panichella -  mining source code descriptions from developers communications

RQ1: How many methods from the analyzed software systems are described by the paragraphs identified by the proposed approach?

Page 27: 130614   sebastiano panichella -  mining source code descriptions from developers communications

RQ2: How precise is the proposed approach in identifying method descriptions?

We sampled 250 descriptions from each project

Page 28: 130614   sebastiano panichella -  mining source code descriptions from developers communications

RQ2: How precise is the proposed approach in identifying method descriptions?

We sampled 250 descriptions from each project

Page 29: 130614   sebastiano panichella -  mining source code descriptions from developers communications

RQ2: How precise is the proposed approach in identifying method descriptions?

We sampled 250 descriptions from each project

Page 30: 130614   sebastiano panichella -  mining source code descriptions from developers communications

RQ3: How many potentially good method descriptions are missed by the approach?

TABLE III The analysis of a sample of 100 paragraphs traced to methods,

but not satisfying the Step 4 heuristic

System True Negatives False Negatives

Eclipse 78 22

Apache Lucene 67 33

We sampled 100 descriptions from each project

Page 31: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Conclusion

Page 32: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Conclusion

Page 33: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Conclusion

Page 34: 130614   sebastiano panichella -  mining source code descriptions from developers communications

Conclusion