Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea...

20
Duplicate code detection using anti- unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara

Transcript of Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea...

Duplicate code detection using anti-unification

Peter BulychevMoscow State University

Marius MineaInstitute eAustria,

Timisoara

Outline

Code duplication problem Our anti-unification based

algorithm Comparison with existing methods Clone Digger, the tool for finding

software clones

What is software clone?

Two fragments of code form clone if they are similar enough (according to a given measure of similarity)

for(int i=0; i<5; i++) for(j=0; j<=i; j++) cout << i+j;

for(int k=0; k<6; k++) for(m=0; m<=k; m++) cout << k+m;

Why is it important to detect code clones? 5% - 20% of code in software systems are

clones1

Why do programmers produce clones?2

Development strategy Maintenance benefits Overcoming underlying limitations Cloning by accident

Why is the presence of code clones bad? Errors in the original must be fixed in every clone

1. I.D. Baxter, et.al. Clone Detection Using Abstract Syntax Trees, 1998.2. C.K. Roy and J.R. Cordy. A Survey on Software Clone Detection Research,

2007.

Our clone definition Different clone definitions can be classified

according to the level of granularity: List of strings Sequence of tokens Abstract syntax trees (AST) Semantic information

We work on the AST level We consider two sequences of statements

a clone if one of them can be obtained from the other by replacing some subtrees

Example

x = a;y = f(x,i);cout << y;

x = a + b;y = f(x,j);cout << y;

;

= cout

x + y

a b

=

y f

x j

;

= cout

x a y

=

y f

x i

The sketch of the algorithm

Partition similar statements into clusters

Find pairs of identical cluster sequences

Refine by examining identified code sequences for structural similarity

i=0 i++f(i)

k++ f(k)k=0

i=0 f(k)

Main problems How to compute similarity between two

trees? Use editing distance

How to compute similarity between a new tree and an existing tree cluster? Comparing with each tree in cluster is

expensive Compare new tree with an average value

stored for a cluster

Anti-unification Anti-unifier of two trees is the most

specific generalization that matches both

?

f

+ *?

x y x 2

f

+ /

x z x 2

f

+

x ?

Anti-unification features

Anti-unifier of a set of trees keeps common features: tree structure and common labels

Anti-unification can be used to compute editing distance between two trees:

Ө1 и Ө2 - substitutions, E0 Ө1=E1 и E0 Ө2=E2

distance = |Ө1| + |Ө2|

The first phase:building clusters of statements

We use a simple one-pass clustering algorithm

for each tree in statement trees:

bestcluster = argmax(cluster.add_cost(tree))if bestcluster.add_cost(tree) < threshold

bestcluster.append(tree)else

clusters.append(new Cluster(tree))

Finding the best cluster What add_cost function should we use?

Cost value should be high for these cases: If cluster is large and by joining the new tree

the cluster’s average value changes significantly

If the average value of the new cluster is far away from the tree

add_cost = n * (|au| - |au’|) + (|tree| - |au’|) n – the old size of the cluster au – the old anti-unifier of the cluster au’ - the new anti-unifier of the cluster

Increase of effectiveness In order not to compare each AST with

each other AST we use hashing. The upper parts of the trees are hashed.

=

[ ] +

a bx 0

=

[ ] +

a +x 0

b c

Why is this not enough? By considering pairs from the same cluster

only individually we miss sequences of statements

We should find all pairs of identical cluster sequences and then check them for similarity

void f() { // cluster №1cin >> i; // cluster №2int j = i * 100; // cluster №3cout << i << j; // cluster №4}

void f(int j) { // cluster №5cin >> i; // cluster №2int j = i * 100; // cluster №3cout << j; // cluster №6}

The second phase:finding all common subsequences

After the first phase each statement node is marked with the ID of its cluster

We want to find all pairs of similar sequences of cluster IDs

We do it using suffix trees Only long common subsequences

are considered

The third phase:finding similar sequences of statements

i=0 k=3 f(i,k) k=0 n=3 f(k,n)

i=0 k=3 f(i,k) k=0 n=3 f(k,n)

Comparison with existing AST methods W. Yang, 1991

Editing distance between two trees I. Baxter, et. al, 1998

Hash functions on subtrees, some kind of editing distance

V. Wahler, 2004 Feature vectors comparison

S. Evans, et. al, 2007 Subtree patterns (similar to anti-unification),

hash functions on subtrees

Clone Digger The tool is written in Python Supported languages:

Python (ASTs are build using standard package “compiler”)

Java 1.5 (parser generator ANTLR) The information on found clones is

written to HTML with a highlighting of differences

It’s application to open-source projects NLTK and BioPython showed, that they are 12% clones

Clone Digger

Provided under the GPL license and can be downloaded from the site

http://clonedigger.sourceforge.net

Thank you!