Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code...
Transcript of Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code...
![Page 1: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/1.jpg)
2IS55 Software Evolution
Code duplication
Alexander Serebrenik
![Page 2: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/2.jpg)
Assignments: Reminder
• Assignment 2 • Architecture
Reconstruction • Deadline: Today!
• Assignment 3 • Individual • Code duplication • Replication study of a
scientific paper
/ SET / W&I PAGE 1 27-3-2012
![Page 3: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/3.jpg)
Sources
/ SET / W&I PAGE 2 27-3-2012
“Clone detection” Rainer Koschke http://www.informatik.uni-bremen.de/st/lehre/re09/softwareklone.pdf
![Page 4: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/4.jpg)
Where are we now?
/ SET / W&I PAGE 3 27-3-2012
• Last week: • Code cloning, code duplication, redundancy…
• Type 1, 2, 3, 4 clones (more refined classif. possible) • Useful: reliability, reduced time, code preservation • Harmful: more interrelated code, more bugs • Ignore, eliminate, prevent, manage • Detection mechanisms − Text-based − Metrics-based − Token-based
![Page 5: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/5.jpg)
Today
• Clone detection techniques • AST-based − [Baxter 1996] − AST+Tokens combined [Koschke et al. 2006]
• Program Dependence Graph − [Krinke 2001]
• Comparison of different techniques
/ SET / W&I PAGE 4 27-3-2012
![Page 6: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/6.jpg)
AST-based clone detection [Baxter 1996]
• If we have a tokenizer we might also have a parser! • Applicability: the program should be parseable
/ SET / W&I PAGE 5 27-3-2012
________ ________ ________ ________ ________ ________ Code AST AST with identified
clones
• Compare every subtree with every other subtree? • For an AST of n nodes: O(n3)
• Similarly to text: Partitioning with a hash function • Works for Type 1 clones
![Page 7: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/7.jpg)
AST-based detection
• Type 2 • Either take a bad hash function
ignoring small subtrees, e.g., names • Or replace identity by similarity
• Type 3 • Sequences of subtrees • Go from Type 2-cloned subtrees to
their parents
• Rather precise but still slow
/ SET / W&I PAGE 6 27-3-2012
( ) ( )( ) ( )2121
2121 ,,*2
,*2,TTDifferenceTTSame
TTSameTTSimilarity+
=
![Page 8: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/8.jpg)
Recapitulation from the last week
• [Baker 1995] • Token-based • Very fast: − 1.1 MLOC, minimal clone size: 30 LOC − 7 minutes on SGI IRIX 4.1, 40MHz, 256 MB
• [Baxter 1996]
• AST-based • Precise but slow
• Idea: Combine the two! [Koschke et al. 2006]
• In fact they do not use [Baker 1995] but a different token-based approach
/ SET / W&I PAGE 7 27-3-2012
![Page 9: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/9.jpg)
AST + Tokens [Koschke et al. 2006]
/ SET / W&I PAGE 8 27-3-2012
________ ________ ________ ________
Code AST
________ ________ ________ ________
Serialized AST
_ _ _ _ _ _ _ __ _ __ _ _ _ __ ___
Token clones
if q then z = k; else bar; end if;
if
id call =
cond then else
id id id
lhs rhs target
if id = id id call id Preorder
Incomplete syntactical unit: • Undesirable as a clone • Identification?
Solution • Record the number of descendants • Complete unit: node with all its descendants
6 0 0 0 0 2 1
Result: AST + Tokens is reasonably fast
(faster than pure AST)
![Page 10: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/10.jpg)
Next step
• AST is a tree is a graph
• There are also other graph representations • Object Flow Graph (weeks 3 and 4) • UML class/package/… diagrams • Program Dependence Graph
• These representations do not depend on textual order
• { x = 5; y = 7; } vs. { y = 7; x = 5; }
/ SET / W&I PAGE 9 27-3-2012
![Page 11: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/11.jpg)
[Krinke 2001] PDG based
• Vertices: • entry points, in- and
output parameters • assignments, control
statements, function calls • variables, operators
• Edges: • immediate dependencies − target has to be
evaluated before the source
/ SET / W&I PAGE 10 27-3-2012
y = b + c; x = y + z;
assign
ref. b
ref. c
operator +
ref. y
assign
ref. x
compound
ref. y
ref. z
operator +
![Page 12: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/12.jpg)
[Krinke 2001] PDG based
• Vertices: • entry points, in- and output
parameters • assignments, control
statements, function calls • variables, operators
• Edges: • immediate dependencies • value dependencies • reference dependencies • data dependencies • control dependencies − Not in this example / SET / W&I PAGE 11 27-3-2012
y = b + c; x = y + z;
assign
ref. b
ref. c
operator +
ref. y
assign
ref. x
compound
ref. y
ref. z
operator +
![Page 13: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/13.jpg)
Identification of similar subgraphs – Theory
/ SET / W&I PAGE 12 27-3-2012
• Start with 1 and 10 • Partition the incident
edges based on their labels • Select classes present
in both graphs • Add the target vertices
to the set of reached vertices
• Repeat the process
• “Maximal similar subgraphs”
![Page 14: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/14.jpg)
Identification of similar subgraphs – Practice
• Sorts of edges are labels
• We also need to compare labels of vertices
• We should stop after k iterations • Higher k ⇒ higher recall • Higher k ⇒ higher
execution time • Experiment: k = 20
/ SET / W&I PAGE 13 27-3-2012
assign
ref. b
ref. c
operator +
ref. y
assign
ref. x
compound
ref. y
ref. z
operator +
![Page 15: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/15.jpg)
Choosing your tools: Precision / Recall
/ SET / W&I PAGE 14 27-3-2012
• Quality depends on scenario [Type 1, Type 2, Type 3] • [Roy et al. 2009]: 6 is maximal grade, 0 – minimal
Tool Technique Category S1 S2 S3 Duploc Ducasse Text 4 0 2.8 Marcus and Maletic 2.6 1.8 1.6 Dup Baker Token 4 2.8 0 CCFinder Kamiya 5 3.8 0.8 CloneDr Baxter AST 6 4.3 3.8 cpdetector Koschke 6 3.8 0 Mayrand Metrics 3.3 4.8 3.4 Duplix Krinke Graph 5 4.8 4
More tools: ConQAT, DECKARD, Dude, Simian
![Page 16: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/16.jpg)
Which technique/tool is the best one?
• Quality • Precision • Recall
• Usage • Availability • Dependence on a
platform • Dependence on an
external component (lexer, tokenizer, …)
• Input/output format
/ SET / W&I PAGE 15 27-3-2012
• Programming language • Clones
• Granularity • Types • Pairs vs. groups
• Technique • Normalization • Storage • Worst-case complexity • Pre-/postprocessing
• Validation • Extra: metrics
![Page 17: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/17.jpg)
Clone detection techniques: Summary
• Many different techniques • Text, metrics, tokens, AST, program dependence graph,
combinations • Techniques are often supported by tools • Precision depends on what kind of clones we need:
• Type 1, Type 2, Type 3, Type 4 • Extra conditions
• Programming language, presence of external tools, platforms, extra’s (metrics), normalization, ...
/ SET / W&I PAGE 16 27-3-2012
![Page 18: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/18.jpg)
2IS55 Software Evolution
Repository mining
Alexander Serebrenik
![Page 19: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/19.jpg)
It is all about communication…
/ W&I / MDSE PAGE 18 27-3-2012
Test #14352 fails sometimes
The error should be somewhere here… What does this code do?
I know how to fix it!
![Page 20: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/20.jpg)
Tools record information
/ W&I / MDSE PAGE 19 27-3-2012
Software repositories
![Page 21: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/21.jpg)
How can the repositories serve you?
• Is the documentation up-to-date?
• How fast are the bugs resolved?
• Who is responsible for • Bugs • Overtly complex code • Code guidelines violations?
• What parts are covered by tests?
/ W&I / MDSE PAGE 20 27-3-2012
![Page 22: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/22.jpg)
Software repositories
• Mail archives • Version control systems
• CVS, Subversion, Git, Mercurial, … • Bug trackers
• Bugzilla, JIRA • Developer networks
• StackOverflow • Combined
• SourceForge, Github
/ SET / W&I PAGE 21 27-3-2012
![Page 23: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/23.jpg)
Mail archives
/ SET / W&I PAGE 22 27-3-2012
• Record communication between the developers • Structure of the community
Tang et al. 2009 GTK+ mailing list participants in Western Europe
Bird et al. 2006 “Centrality” correlates with activity
![Page 24: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/24.jpg)
Bugzilla
• How bugs are being resolved? • How many persons pass a bug around before it is being
resolved? • Do larger files contain more bugs/LOC than the smaller
ones? • Do code clones contain more bugs or propagate more
bugs? • …and many more…
/ SET / W&I PAGE 23 27-3-2012
![Page 25: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/25.jpg)
Version control in a nutshell
• System used to reliably reproduce a specific revision of software over time
• Terms: • mainline or trunk • branch • tag • merge
/ SET / W&I PAGE 24 27-3-2012
http://svn.software-carpentry.org/ swc/3.0/version/edit-update-cycle.png
![Page 26: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/26.jpg)
Version control examples: CVS
/ SET / W&I PAGE 25 27-3-2012
• Information recorded per file • who, when, changes, branch, tags, message
![Page 27: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/27.jpg)
Version control examples: Subversion (SVN)
/ SET / W&I PAGE 26 27-3-2012
• Information recorded per commit • who, when, files, changes, branch, tags, message
• What are advantages/disadvantages of recording information per file and per commit?
![Page 28: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/28.jpg)
So far: Centralized VCS
• Local repositories • Better for subproject delegation • Fast • No need to register
/ SET / W&I PAGE 27 27-3-2012
Images by Stijn Hoop
![Page 29: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/29.jpg)
Distributed version control example: Git
/ SET / W&I PAGE 28 27-3-2012
![Page 30: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/30.jpg)
Distributed version control example: Git
• Distinguishes between different kinds of humans • authors, committers, signed-off-by, …
• Much more branches than in centralized VCS • include in the analysis
• Popular projects: Linux Kernel, Android, …
/ SET / W&I PAGE 29 27-3-2012
![Page 31: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/31.jpg)
Version control systems
• Centralized vs. distributed • File versioning (CVS) vs. product versioning
• Record at least
• File name, file/product version, time stamp, committer • Commit message • Changed lines ⇒ Program differencing
/ SET / W&I PAGE 30 27-3-2012
![Page 32: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/32.jpg)
Program differencing: Remote cousin of cloning • Formally
• Input: Two programs
• Output: − Differences between the two programs − Unchanged code fragments in the old version and
their corresponding locations in the new
• Similar to clone detection • Comparison of lines, tokens, trees and graphs
/ SET / W&I PAGE 31 27-3-2012
![Page 33: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/33.jpg)
Diff: Longest common subsequence
• Program: sequence of lines • Object of comparison: line
• Comparison:
• 1:1 • lines are identical • matched pairs cannot overlap
• Technique: longest common subsequence
• Minimal number of additions/deletions steps • Dynamic programming
/ SET / W&I PAGE 32 27-3-2012
![Page 34: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/34.jpg)
Longest common subsequence
• Programs X (n lines), Y (m lines) • Data structure C[0..n,0..m] • Init: C[r,0]=0, C[0,c]=0 for any r and c
/ SET / W&I PAGE 33 27-3-2012
p0 mA (){ p1 if (pred_a) { p2 foo() p3 } p4 } X
c0 mA (){ c1 if (pred_a0) { c2 if (pred_a) { c3 foo() c4 } c5 } c6 } Y
C c0
c1
c2
c3
c4
c5
c6
0 0 0 0 0 0 0 0
p0 0
p1 0
p2 0
p3 0
p4 0
![Page 35: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/35.jpg)
Longest common subsequence
• For every r and every c • If X[r]=Y[c] then C[r,c]=C[r-1,c-1]+1 • Else C[r,c]=max(C[r,c-1],C[r-1,c])
/ SET / W&I PAGE 34 27-3-2012
p0 mA (){ p1 if (pred_a) { p2 foo() p3 } p4 } X
c0 mA (){ c1 if (pred_a0) { c2 if (pred_a) { c3 foo() c4 } c5 } c6 } Y
C c0
c1
c2
c3
c4
c5
c6
0 0 0 0 0 0 0 0
p0 0 1 1 1 1 1 1 1
p1 0 1 1 2 2 2 2 2
p2 0 1 1 2 3 3 3 3
p3 0 1 1 2 3 4 4 4 p4 0 1 1 2 3 4 5 5
![Page 36: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/36.jpg)
Longest common subsequence • Start with r=n and c=m • backTrace(r,c)
• If r=0 or c=0 then “” • If X[r]=Y[c] then backTrace(r-1,c-1)+X[r] • Else − If C[r,c-1] > C[r-1,c] then backTrace(r,c-1) else backTrace(r-1,c)
/ SET / W&I PAGE 35 27-3-2012
p0 mA (){ p1 if (pred_a) { p2 foo() p3 } p4 } X
c0 mA (){ c1 if (pred_a0) { c2 if (pred_a) { c3 foo() c4 } c5 } c6 } Y
C c0
c1
c2
c3
c4
c5
c6
0 0 0 0 0 0 0 0
p0 0 1 1 1 1 1 1 1
p1 0 1 1 2 2 2 2 2
p2 0 1 1 2 3 3 3 3
p3 0 1 1 2 3 4 4 4 p4 0 1 1 2 3 4 5 5
![Page 37: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/37.jpg)
Longest common subsequence • Start with r=n and c=m • backTrace(r,c)
• If r=0 or c=0 then “” • If X[r]=Y[c] then backTrace(r-1,c-1)+X[r] • Else − If C[r,c-1] > C[r-1,c] then backTrace(r,c-1) else backTrace(r-1,c)
/ SET / W&I PAGE 36 27-3-2012
p0 mA (){ p1 if (pred_a) { p2 foo() p3 } p4 } X
c0 mA (){ c1 if (pred_a0) { c2 if (pred_a) { c3 foo() c4 } c5 } c6 } Y
C c0
c1
c2
c3
c4
c5
c6
0 0 0 0 0 0 0 0
p0 0 1 1 1 1 1 1 1
p1 0 1 1 2 2 2 2 2
p2 0 1 1 2 3 3 3 3
p3 0 1 1 2 3 4 4 4 p4 0 1 1 2 3 4 5 5
![Page 38: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/38.jpg)
Diff: Summarizing
• Comparison: • 1:1, identical lines, non-overlapping pairs
• Technique: longest common subsequence
• What kind of code modifications will diff miss?
• Copy & paste: apple ⇒ applple − 1:1 is violated
• Move: apple ⇒ aplep
/ SET / W&I PAGE 37 27-3-2012
![Page 39: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/39.jpg)
More than lines: AST Diff [Yang 1992]
• Construct ASTs for the input programs
/ SET / W&I PAGE 38 27-3-2012
p0 mA (){ p1 if (pa) { p2 foo() p3 } p4 } p5 mB (b) { p6 a = 1 p7 b = b+1 p8 fun(a,b) p9 } X
c0 mA (){ c1 if (pa0) { c2 if (pa) { c3 foo() c4 } c5 } c6 } c7 mB (b) { c8 b = b+1 c9 a = 1 c10 fun(a,b) c11 } Y
mA mB
Body
if fun =
args
root
Body
= b
pa foo
args a
args
b
b +
b 1
a 1
![Page 40: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/40.jpg)
More than lines: AST Diff [Yang 1992]
• Recursive algo pairwise subtree comparison
/ SET / W&I PAGE 39 27-3-2012
mA mB
Body
if fun =
args
root
Body
= b
pa
foo
args a
args
b
b +
b 1
a 1
mA mB
Body
if fun =
args
root
Body
= b
pa0
a
args
b
b +
b 1
a 1 if
pa
foo
args • n – first level subtrees in X • m – first level subtrees in Y • Array: M[0..n, 0..m]
![Page 41: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/41.jpg)
More than lines: AST Diff [Yang 1992]
/ SET / W&I PAGE 40 27-3-2012
mA mB
Body
if fun =
args
root
Body
= b
pa
foo
args a
args
b
b +
b 1
a 1
mA mB
Body
if fun =
args
root
Body
= b
pa0
a
args
b
b +
b 1
a 1 if
pa
foo
args
M 0 mA mB 0 0 0 0 mA 0 mB 0
M 0 Body 0 0 0 Body 0
M 0 if 0 0 0 if 0
M 0 pa0 if 0 0 0 0 pa 0 foo 0
M[i,j] = max(M[i,j-1],
M[i-1,j], M[i-1,j-1]+W[i,j])
W is the recursive call
If root symbols differ
return 0
![Page 42: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/42.jpg)
More than lines: AST Diff [Yang 1992]
/ SET / W&I PAGE 41 27-3-2012
mA mB
Body
if fun =
args
root
Body
= b
pa
foo
args a
args
b
b +
b 1
a 1
mA mB
Body
if fun =
args
root
Body
= b
pa0
a
args
b
b +
b 1
a 1 if
pa
foo
args
M 0 mA mB 0 0 0 0 mA 0 mB 0
M 0 Body 0 0 0 Body 0
M 0 if 0 0 0 if 0
M 0 pa0 if 0 0 0 0 pa 0 0 0 foo 0 0 0
M[i,j] = max(M[i,j-1],
M[i-1,j], M[i-1,j-1]+W[i,j])
W is the recursive call
If root symbols differ
return 0
When the computation is finished, return
M[n,m]+1
1
![Page 43: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/43.jpg)
More than lines: AST Diff [Yang 1992]
/ SET / W&I PAGE 42 27-3-2012
mA mB
Body
if fun =
args
root
Body
= b
pa
foo
args a
args
b
b +
b 1
a 1
mA mB
Body
if fun =
args
root
Body
= b
pa0
a
args
b
b +
b 1
a 1 if
pa
foo
args
M 0 mA mB 0 0 0 0 mA 0 3 mB 0
M 0 Body 0 0 0 Body 0 2
M 0 if 0 0 0 if 0 1
M 0 pa0 if 0 0 0 0 pa 0 0 0 foo 0 0 0
M[i,j] = max(M[i,j-1],
M[i-1,j], M[i-1,j-1]+W[i,j])
W is the recursive call
If root symbols differ
return 0
When the computation is finished, return
M[n,m]+1
![Page 44: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/44.jpg)
More than lines: AST Diff [Yang 1992]
/ SET / W&I PAGE 43 27-3-2012
mA mB
Body
if fun =
args
root
Body
= b
pa
foo
args a
args
b
b +
b 1
a 1
M 0 mA mB 0 0 0 0 mA 0 3 mB 0
p0 mA (){ p1 if (pa) { p2 foo() p3 } p4 } p5 mB (b) { p6 a = 1 p7 b = b+1 p8 fun(a,b) p9 } X
c0 mA (){ c1 if (pa0) { c2 if (pa) { c3 foo() c4 } c5 } c6 } c7 mB (b) { c8 b = b+1 c9 a = 1 c10 fun(a,b) c11 } Y
![Page 45: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/45.jpg)
Continuing the process
• Advantages • Respect the parent-
child relations • Ignore the order
between siblings
• Disadvantages • Sensitive to tree level
changes (“if (pa)”) • Ignore dependencies
such as data flow, etc
/ SET / W&I PAGE 44 27-3-2012
p0 mA (){ p1 if (pa) { p2 foo() p3 } p4 } p5 mB (b) { p6 a = 1 p7 b = b+1 p8 fun(a,b) p9 } X
c0 mA (){ c1 if (pa0) { c2 if (pa) { c3 foo() c4 } c5 } c6 } c7 mB (b) { c8 b = b+1 c9 a = 1 c10 fun(a,b) c11 } Y
• Can be adapted to OO-specific dataflow (inheritance, exceptions): JDiff
![Page 46: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/46.jpg)
Changes never come alone
• Search and replace • Check-in comment: “Common methods go in an
abstract class. Easier to extend/maintain/fix”
• Change: a rule rather than a set application results • Rules can have exceptions
• Idea [Kim and Notkin 2009]
• Observe differences between subsequent versions • Formalize them as facts • Discover rules (à la data mining) • Record exceptions
/ SET / W&I PAGE 45 27-3-2012
![Page 47: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/47.jpg)
How did the users experience the tool?
• Positive • “You can’t infer the intent of a programmer, but this is
pretty close.” • “This ‘except’ thing is great!”
• Negative • “This looks great for big architectural changes, but I
wonder what it would give you if you had lots of random changes.”
• “This will look for relationships that do not exist.”
/ SET / W&I PAGE 46 27-3-2012
![Page 48: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/48.jpg)
What about “random” changes?
• eROSE [Zimmermann, Weißgerber, Diehl, Zeller ‘04]
/ SET / W&I PAGE 47 27-3-2012
Developers who modified this function also modified…
![Page 49: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/49.jpg)
/ SET / W&I PAGE 48 27-3-2012
ROSE keeps a list of association rules
![Page 50: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/50.jpg)
eROSE
• ROSE alerts for incomplete changes
/ SET / W&I PAGE 49 27-3-2012
How? Association rules: {(Comp.java, field, fKeys[])} ⇒ { (Comp.java, method, initDefaults()), (plug.properties, file, plug.properties) }
![Page 51: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/51.jpg)
Experimental evaluation
/ SET / W&I PAGE 50 27-3-2012
• Recall: 0.15 • suggestion included 15% of all
changes that were carried out • Precision: 0.26
• 26% of all recommendations were correct
• Likelyhood: • 70% of all transactions, topmost
three suggestions contain a changed entity.
• EROSE learns quickly • within 30 days
• Extensive evaluation
![Page 52: Code duplication - Faculteit Wiskunde en Informaticaaserebre/2IS55/2011-2012/6.pdf• Code duplication • Replication study of a scientific paper / SET / W&I 27-3-2012 PAGE 1 Sources](https://reader031.fdocuments.us/reader031/viewer/2022030415/5aa112cb7f8b9a0d158f0a5b/html5/thumbnails/52.jpg)
Conclusions
• Code cloning • Repositories
• Version control − File/commit-level change, centralized/distributed
• Mail archives, bug trackers • Differencing
• Two approaches to identification of related differences: − Both based on data mining/rule learning − Interesting ideas, not always impressive results − A lot of improvement is possible!
/ SET / W&I PAGE 51 27-3-2012