ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Post on 02-Jul-2015

363 views 1 download

description

Information Retrieval methods have been largely adopted to identify traceability links based on the textual similarity of software artifacts. However, noise due to word usage in software artifacts might negatively affect the recovery accuracy. We propose the use of smoothing filters to reduce the effect of noise in software artifacts and improve the performances of traceability recovery methods. An empirical evaluation performed on two repositories indicates that the usage of a smoothing filter is able to significantly improve the performances of Vector Space Model and Latent Semantic Indexing. Such a result suggests that other than being used for traceability recovery the proposed filter can be used to improve performances of various other software engineering approaches based on textual analysis.

Transcript of ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Improving IR-based Traceability Recovery Using Smoothing Filters

Andrea Massimiliano Rocco Annibale SebastianoDe Lucia Di Penta Oliveto Panichella Panichella

Software traceability

“The degree to which a relationship can be established between two products of a software development process” [IEEE Glossary for Software Terminology]

Important for: program comprehension requirement tracing impact analysis software reuse …

Up-to-date traceability links rarely exists → need to recover them

Up-to-date traceability links rarely exists → need to recover them

Use case Sourcecode Test case

Use case Sourcecode Test case

IR-based traceability recovery

Antoniol et al., 2002 (VSM+Probabilistic model) Marcus and Maletic, 2003 (LSI)

Antoniol et al., 2002 (VSM+Probabilistic model) Marcus and Maletic, 2003 (LSI)

Traditional IR vs.IR applied to Software Engineering

Traditional IR Deals with

heterogeneous documents for what concerns: Linguistic choices Syntax Semantics

We just live with that differences

IR applied to SE We have sets of

homogeneous documents for what concerns Syntax, linguistic

choices Examples:

Use cases, test documents, design documents follow a common template and contain recurrent words

Test case Change the date for a visit:

C51 Version: 0 02 000

Use case Satisfies the request to modify a visit for a patient

UcModVis

Priority High

....

Test description

Input Select a visit:

26/09/2003 11:00 First visit

Change: 03/10/2003 11:00

Oracle Invalid sequence: The system does not allow to change a booking

Coverage Valid classes: CE1 CE8 CE14 CE19 CE21

Invalid classes: None

Test case Change the date for a visit:

C51 Version: 0 02 000

Use case Satisfies the request to modify a visit for a patient

UcModVis

Priority High

....

Test description

Input Select a visit:

26/09/2003 11:00 First visit

Change: 03/10/2003 11:00

Oracle Invalid sequence: The system does not allow to change a booking

Coverage Valid classes: CE1 CE8 CE14 CE19 CE21

Invalid classes: None

Problem

Different kinds of software artifacts require specific preprocessing

Test case Change the date for a visit:

C51 Version: 0 02 000

Use case Satisfies the request to modify a visit for a patient

UcModVis

Priority High

....

Test description

Input Select a visit:

26/09/2003 11:00 First visit

Change: 03/10/2003 11:00

Oracle Invalid sequence: The system does not allow to change a booking

Coverage Valid classes: CE1 CE8 CE14 CE19 CE21

Invalid classes: None

Test case Change the date for a visit:

C51 Version: 0 02 000

Use case Satisfies the request to modify a visit for a patient

UcModVis

Priority High

....

Test description

Input Select a visit:

26/09/2003 11:00 First visit

Change: 03/10/2003 11:00

Oracle Invalid sequence: The system does not allow to change a booking

Coverage Valid classes: CE1 CE8 CE14 CE19 CE21

Invalid classes: None

Problem

Different kinds of software artifacts require specific preprocessing

Artifact-specif ic words do not bring useful

information

A similar problem: image processing

Pixels with peaks of low color intensity

Noise

Noisy images

Pixels with peaks of high color intensity

Mean filter

Reducing noise using smoothing filters

∑∈

=Smnf

mnfM

yxg),(

),(1

),(

Image vs. traceability noise

Image noise: Pixels with high or

low color intensity Pixels are position

dependent

Traceability noise: Terms and linguistic

patterns occurring in many artifacts of a given category Use cases, test

cases.. Artifacts (columns) are

position independentd1 d2 d2 d1

1 2 3 1 2 3

1,1 1,2 1,3 1,1

2,1 2,2 2,3 2,2

,1 ,2 ,3

s s s s t t t t

k z

k

n n nn

v v v vword

v v v vword

v v vword

L L

L

L

M O M O MM O

L L L

1,1 1,2 1,3 1,

2,1 2,2 2,3 2,

, ,1 ,2 ,3 ,

z

k z

n k n n n n z

v v v v

v v v v

v v v v v

L

L

M M O M O M MO

L L L

Source Documents Target Documents

Linguistic information strictly belonging to source documents

Linguistic information strictly belonging to target documents

Common Information for Source Documents

Common Information For target documents

Representing the noise

1 2 3 1 2 3

1,1 1,2 1,3 1,1

2,1 2,2 2,3 2,2

,1 ,2 ,3

s s s s t t t t

k z

k

n n nn

v v v vword

v v v vword

v v vword

L L

L

L

M O M O MM O

L L L

1,1 1,2 1,3 1,

2,1 2,2 2,3 2,

, ,1 ,2 ,3 ,

z

k z

n k n n n n z

v v v v

v v v v

v v v v v

L

L

M M O M O M MO

L L L

Mean source vector Mean target vector

1,1

2,1

,1

1

1

1

k

jj

k

jj

k

n jj

vk

vkS

vk

=

=

=

=

∑M

1,1

2,1

,1

1

1

1

m

jj k

m

jj k

m

n jj k

vz

vzT

vz

= +

= +

= +

=

∑M

Representing the noise

Source Documents Target Documents

Common Information for Source Documents

Common Information For target documents

The Mean Vectors are like the continuous component of a signal…The Mean Vectors are like the continuous component of a signal…

S= T=

1 2 3 1 2 3

1,1 1,2 1,3 1,1

2,1 2,2 2,3 2,2

,1 ,2 ,3

s s s s t t t t

k z

k

n n nn

v v v vword

v v v vword

v v vword

L L

L

L

M O M O MM O

L L L

1,1 1,2 1,3 1,

2,1 2,2 2,3 2,

, ,1 ,2 ,3 ,

z

k z

n k n n n n z

v v v v

v v v v

v v v v v

L

L

M M O M O M MO

L L L

S(mean target

vector)

T(mean target vector)

Representing the noise

Source Documents Target Documents

FilteredSource SetFiltered

Source Set

-

FilteredTarget SetFiltered

Target Set

-

Empirical Study

Goal: analyze the effect of smoothing filter Purpose: investigating how the filter affects

traceability recovery Quality focus: traceability recovery performance

(precision and recall) Perspective:

Researchers: evaluating the novel technique Project managers: adopt a better traceability recovery

technique Context: artifacts from two systems

EasyClinic and Pine

Context

EasyClinic Pine

Description Medical doctor office management

Text-based email client

Language Java CFiles/Classes

37 31

KLOC 20 130Documents 113 100Language Italian EnglishArtifacts Use cases

Interaction diagramsSource codeTest cases

RequirementsUse cases

Research Questions and Factors

RQ1: Does the smoothing filter improve the recovery performances of VSM-based traceability recovery?

RQ2: Does the smoothing filter improve the recovery performances of LSI-based traceability recovery?

RQ3: How do the performances vary for different types of artifacts?

Factors: Use of filter: YES, NO Technique: VSM, LSI Artifact: Req., UC, Int. Diagrams, Code, TC System: Easyclinic, Pine

Analysis Method

We statistically compare the # of false positives of different methods for each correct link identified Wilcoxon Rank Sum test Cliff’s delta effect size

correct

retrievedcorrectrecall

∩=

retrieved

retrievedcorrectprecision

∩=

Performances evaluated by precision and recall:

2

2

0

3

M1 M2

Results

EasyClinic: Use cases into source (VSM)

Recall

Prec

isio

n

Filtered

Not Filtered

EasyClinic: Use cases into source (LSI)

Filtered

Not Filtered

Prec

isio

n

Recall

EasyClinic: Test cases into source (LSI)

Filtered Not Filtered

Prec

isio

n

Recall

Test cases are: Short documents Limited vocabulary Mostly consistent with

source code

Pine: Use cases into requirements (LSI)

Filtered

Not Filtered

Prec

isio

n

Recall

Statistical Comparison

Data set TracedArtifacts

VSM LSI

p-value Effect size p-value Effect size

EasyClinic UC→Code <0.01 0.50(large)

<0.01 0.50(large)

Int. Diag. → Code

<0.01 0.52(large)

<0.01 0.34(medium)

Pine TC → Code 1.00 -(negligible)

1.00 -(negligible)

Req. → UC <0.01 0.58(large)

<0.01 0.58(large)

Link precision improvement

Login Patientvs. PersonPoor vocabulary overlap (10%)

Login Patientvs. PersonPoor vocabulary overlap (10%)

Threats to validity

Construct validity Mainly related to our oracle Provided by developers and for EasyClinic also peer-

reviewed Internal validity

Improvements could be due to other reasons… However we compared different techniques (VSM, LSI) The approach works well regardless of stop word

removal/stemming and use of tf-idf Conclusion validity

Conclusions based on proper (non-parametric) statistics External validity

We considered systems with different characteristics and artifacts

… but further studies are desirable

Conclusions

We proposed the use of smoothing filter to improve performances of IR-based traceability recovery

Idea inspired from digital signal processing The filter significantly improves IR-based

traceability recovery based on VSM (RQ1) and LSI (RQ2)

Filter particularly suitable for artifacts having a higher verbosity (RQ3) e.g., requirements and use cases

Less useful for artifacts composed of short sentences and using a limited vocabulary e.g., test cases

Work-in-progress

Study replication Different systems and artifacts Use of relevance feedback

More sophisticated smoothing technique Non linear filters

Use in other applications of IR to software engineering E.g. impact analysis or feature location

That’s all folks!