ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

28
Improving IR-based Traceability Recovery Using Smoothing Filters Andrea Massimiliano Rocco Annibale Sebastiano De Lucia Di Penta Oliveto Panichella Panichella

description

Information Retrieval methods have been largely adopted to identify traceability links based on the textual similarity of software artifacts. However, noise due to word usage in software artifacts might negatively affect the recovery accuracy. We propose the use of smoothing filters to reduce the effect of noise in software artifacts and improve the performances of traceability recovery methods. An empirical evaluation performed on two repositories indicates that the usage of a smoothing filter is able to significantly improve the performances of Vector Space Model and Latent Semantic Indexing. Such a result suggests that other than being used for traceability recovery the proposed filter can be used to improve performances of various other software engineering approaches based on textual analysis.

Transcript of ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Page 1: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Improving IR-based Traceability Recovery Using Smoothing Filters

Andrea Massimiliano Rocco Annibale SebastianoDe Lucia Di Penta Oliveto Panichella Panichella

Page 2: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Software traceability

“The degree to which a relationship can be established between two products of a software development process” [IEEE Glossary for Software Terminology]

Important for: program comprehension requirement tracing impact analysis software reuse …

Up-to-date traceability links rarely exists → need to recover them

Up-to-date traceability links rarely exists → need to recover them

Use case Sourcecode Test case

Use case Sourcecode Test case

Page 3: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

IR-based traceability recovery

Antoniol et al., 2002 (VSM+Probabilistic model) Marcus and Maletic, 2003 (LSI)

Antoniol et al., 2002 (VSM+Probabilistic model) Marcus and Maletic, 2003 (LSI)

Page 4: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Traditional IR vs.IR applied to Software Engineering

Traditional IR Deals with

heterogeneous documents for what concerns: Linguistic choices Syntax Semantics

We just live with that differences

IR applied to SE We have sets of

homogeneous documents for what concerns Syntax, linguistic

choices Examples:

Use cases, test documents, design documents follow a common template and contain recurrent words

Page 5: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Test case Change the date for a visit:

C51 Version: 0 02 000

Use case Satisfies the request to modify a visit for a patient

UcModVis

Priority High

....

Test description

Input Select a visit:

26/09/2003 11:00 First visit

Change: 03/10/2003 11:00

Oracle Invalid sequence: The system does not allow to change a booking

Coverage Valid classes: CE1 CE8 CE14 CE19 CE21

Invalid classes: None

Test case Change the date for a visit:

C51 Version: 0 02 000

Use case Satisfies the request to modify a visit for a patient

UcModVis

Priority High

....

Test description

Input Select a visit:

26/09/2003 11:00 First visit

Change: 03/10/2003 11:00

Oracle Invalid sequence: The system does not allow to change a booking

Coverage Valid classes: CE1 CE8 CE14 CE19 CE21

Invalid classes: None

Problem

Different kinds of software artifacts require specific preprocessing

Page 6: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Test case Change the date for a visit:

C51 Version: 0 02 000

Use case Satisfies the request to modify a visit for a patient

UcModVis

Priority High

....

Test description

Input Select a visit:

26/09/2003 11:00 First visit

Change: 03/10/2003 11:00

Oracle Invalid sequence: The system does not allow to change a booking

Coverage Valid classes: CE1 CE8 CE14 CE19 CE21

Invalid classes: None

Test case Change the date for a visit:

C51 Version: 0 02 000

Use case Satisfies the request to modify a visit for a patient

UcModVis

Priority High

....

Test description

Input Select a visit:

26/09/2003 11:00 First visit

Change: 03/10/2003 11:00

Oracle Invalid sequence: The system does not allow to change a booking

Coverage Valid classes: CE1 CE8 CE14 CE19 CE21

Invalid classes: None

Problem

Different kinds of software artifacts require specific preprocessing

Artifact-specif ic words do not bring useful

information

Page 7: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

A similar problem: image processing

Page 8: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Pixels with peaks of low color intensity

Noise

Noisy images

Pixels with peaks of high color intensity

Page 9: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Mean filter

Reducing noise using smoothing filters

∑∈

=Smnf

mnfM

yxg),(

),(1

),(

Page 10: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Image vs. traceability noise

Image noise: Pixels with high or

low color intensity Pixels are position

dependent

Traceability noise: Terms and linguistic

patterns occurring in many artifacts of a given category Use cases, test

cases.. Artifacts (columns) are

position independentd1 d2 d2 d1

Page 11: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

1 2 3 1 2 3

1,1 1,2 1,3 1,1

2,1 2,2 2,3 2,2

,1 ,2 ,3

s s s s t t t t

k z

k

n n nn

v v v vword

v v v vword

v v vword

L L

L

L

M O M O MM O

L L L

1,1 1,2 1,3 1,

2,1 2,2 2,3 2,

, ,1 ,2 ,3 ,

z

k z

n k n n n n z

v v v v

v v v v

v v v v v

L

L

M M O M O M MO

L L L

Source Documents Target Documents

Linguistic information strictly belonging to source documents

Linguistic information strictly belonging to target documents

Common Information for Source Documents

Common Information For target documents

Representing the noise

Page 12: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

1 2 3 1 2 3

1,1 1,2 1,3 1,1

2,1 2,2 2,3 2,2

,1 ,2 ,3

s s s s t t t t

k z

k

n n nn

v v v vword

v v v vword

v v vword

L L

L

L

M O M O MM O

L L L

1,1 1,2 1,3 1,

2,1 2,2 2,3 2,

, ,1 ,2 ,3 ,

z

k z

n k n n n n z

v v v v

v v v v

v v v v v

L

L

M M O M O M MO

L L L

Mean source vector Mean target vector

1,1

2,1

,1

1

1

1

k

jj

k

jj

k

n jj

vk

vkS

vk

=

=

=

=

∑M

1,1

2,1

,1

1

1

1

m

jj k

m

jj k

m

n jj k

vz

vzT

vz

= +

= +

= +

=

∑M

Representing the noise

Source Documents Target Documents

Common Information for Source Documents

Common Information For target documents

The Mean Vectors are like the continuous component of a signal…The Mean Vectors are like the continuous component of a signal…

S= T=

Page 13: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

1 2 3 1 2 3

1,1 1,2 1,3 1,1

2,1 2,2 2,3 2,2

,1 ,2 ,3

s s s s t t t t

k z

k

n n nn

v v v vword

v v v vword

v v vword

L L

L

L

M O M O MM O

L L L

1,1 1,2 1,3 1,

2,1 2,2 2,3 2,

, ,1 ,2 ,3 ,

z

k z

n k n n n n z

v v v v

v v v v

v v v v v

L

L

M M O M O M MO

L L L

S(mean target

vector)

T(mean target vector)

Representing the noise

Source Documents Target Documents

FilteredSource SetFiltered

Source Set

-

FilteredTarget SetFiltered

Target Set

-

Page 14: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Empirical Study

Goal: analyze the effect of smoothing filter Purpose: investigating how the filter affects

traceability recovery Quality focus: traceability recovery performance

(precision and recall) Perspective:

Researchers: evaluating the novel technique Project managers: adopt a better traceability recovery

technique Context: artifacts from two systems

EasyClinic and Pine

Page 15: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Context

EasyClinic Pine

Description Medical doctor office management

Text-based email client

Language Java CFiles/Classes

37 31

KLOC 20 130Documents 113 100Language Italian EnglishArtifacts Use cases

Interaction diagramsSource codeTest cases

RequirementsUse cases

Page 16: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Research Questions and Factors

RQ1: Does the smoothing filter improve the recovery performances of VSM-based traceability recovery?

RQ2: Does the smoothing filter improve the recovery performances of LSI-based traceability recovery?

RQ3: How do the performances vary for different types of artifacts?

Factors: Use of filter: YES, NO Technique: VSM, LSI Artifact: Req., UC, Int. Diagrams, Code, TC System: Easyclinic, Pine

Page 17: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Analysis Method

We statistically compare the # of false positives of different methods for each correct link identified Wilcoxon Rank Sum test Cliff’s delta effect size

correct

retrievedcorrectrecall

∩=

retrieved

retrievedcorrectprecision

∩=

Performances evaluated by precision and recall:

2

2

0

3

M1 M2

Page 18: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Results

Page 19: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

EasyClinic: Use cases into source (VSM)

Recall

Prec

isio

n

Filtered

Not Filtered

Page 20: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

EasyClinic: Use cases into source (LSI)

Filtered

Not Filtered

Prec

isio

n

Recall

Page 21: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

EasyClinic: Test cases into source (LSI)

Filtered Not Filtered

Prec

isio

n

Recall

Test cases are: Short documents Limited vocabulary Mostly consistent with

source code

Page 22: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Pine: Use cases into requirements (LSI)

Filtered

Not Filtered

Prec

isio

n

Recall

Page 23: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Statistical Comparison

Data set TracedArtifacts

VSM LSI

p-value Effect size p-value Effect size

EasyClinic UC→Code <0.01 0.50(large)

<0.01 0.50(large)

Int. Diag. → Code

<0.01 0.52(large)

<0.01 0.34(medium)

Pine TC → Code 1.00 -(negligible)

1.00 -(negligible)

Req. → UC <0.01 0.58(large)

<0.01 0.58(large)

Page 24: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Link precision improvement

Login Patientvs. PersonPoor vocabulary overlap (10%)

Login Patientvs. PersonPoor vocabulary overlap (10%)

Page 25: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Threats to validity

Construct validity Mainly related to our oracle Provided by developers and for EasyClinic also peer-

reviewed Internal validity

Improvements could be due to other reasons… However we compared different techniques (VSM, LSI) The approach works well regardless of stop word

removal/stemming and use of tf-idf Conclusion validity

Conclusions based on proper (non-parametric) statistics External validity

We considered systems with different characteristics and artifacts

… but further studies are desirable

Page 26: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Conclusions

We proposed the use of smoothing filter to improve performances of IR-based traceability recovery

Idea inspired from digital signal processing The filter significantly improves IR-based

traceability recovery based on VSM (RQ1) and LSI (RQ2)

Filter particularly suitable for artifacts having a higher verbosity (RQ3) e.g., requirements and use cases

Less useful for artifacts composed of short sentences and using a limited vocabulary e.g., test cases

Page 27: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

Work-in-progress

Study replication Different systems and artifacts Use of relevance feedback

More sophisticated smoothing technique Non linear filters

Use in other applications of IR to software engineering E.g. impact analysis or feature location

Page 28: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters

That’s all folks!