Cross Language Clone Analysis Team 2 October 27, 2010.
-
Upload
rosemary-daniels -
Category
Documents
-
view
220 -
download
1
Transcript of Cross Language Clone Analysis Team 2 October 27, 2010.
![Page 1: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/1.jpg)
Presentation 5Cross Language Clone Analysis
Team 2October 27, 2010
![Page 2: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/2.jpg)
• Current Tasks• GOLD Parsing System• Grammar Update• Clone Analysis• Demonstration• Team Collaboration• Path Forward
Agenda
2
![Page 3: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/3.jpg)
Allen Tucker Patricia Bradford Greg Rodgers Brian Bentley Ashley Chafin
Our Team
3
![Page 4: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/4.jpg)
Current TasksWhat we are tackling…
4
![Page 5: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/5.jpg)
Current tasks created for the first user story “Source Code Load & Translate”:◦ Load & parse C# source code.◦ Load & parse JAVA source code.◦ Load & parse C++ source code.◦ Translate the parsed C# source code to
CodeDOM.◦ Translate the parsed JAVA source code to
CodeDOM.◦ Translate the parsed C++ source code to
CodeDOM.◦ Associate the CodeDOM to the original source
code.
Current Tasks (Review)
5
![Page 6: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/6.jpg)
UML Model – Load & Parse
6
![Page 7: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/7.jpg)
UML Model – Translate
7
![Page 8: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/8.jpg)
UML Model – Associate
8
![Page 9: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/9.jpg)
GOLD Parsing SystemGOLD Parsing Populating CodeDOM
9
![Page 10: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/10.jpg)
Topics To Discuss What we are doing? Compiled Grammar Table Bookkeeping Testing
10
![Page 11: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/11.jpg)
How It Works (Block Structure)
Grammar Builder
Compiled Grammar
Table (*.cgt)
Engine
Source Code
Parsed
Data
11
![Page 12: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/12.jpg)
How It Works (Process)
Grammar Builder
Compiled Grammar
Table (*.cgt)
Engine
Source Code
Parsed
Data
Typical output from engine: a long nested tree
12
![Page 13: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/13.jpg)
Usage within CloneDigger
Compiled Grammar
Table (*.cgt)
Engine
Source Code
Parsed
Data
CodeDOM Conversion• Need to write routine to move
data from Parsed Tree to CodeDOM• Parsed data trees from parser
are stored in consistent data structure, but are based on rules defined within grammars
CodeDOM Conversi
on
AST
13
![Page 14: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/14.jpg)
For Java, there is…◦ 359 production rules◦ 249 distinctive symbols (terminal & non-terminal)
For C#, there is…◦ 415 production rules◦ 279 distinctive symbols (terminal & non-terminal)
Compiled Grammar Table
14
![Page 15: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/15.jpg)
Production Rule Dependancies
![Page 16: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/16.jpg)
Since there are so many production rules, we came up with the following bookkeeping:
A spreadsheet of the compiled grammar table (for each language) with each production rule indexed.◦ This spreadsheet covers:
various aspects of language what we have/have not handled from the parser what we have/have not implemented into CodeDOM percentage complete
Our Grammar Bookkeeping
16
![Page 17: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/17.jpg)
Our Grammar Bookkeeping
17
![Page 18: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/18.jpg)
White Box Testing: ◦ Unit Testing
Black Box Testing:◦ Production Rule Testing
Allows us to test the robustness of our engine because we can force rule production errors.
Regression Testing Automated
Testing
18
![Page 19: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/19.jpg)
Unit Testing
19
![Page 20: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/20.jpg)
Production Rule Test Input File Example
20
![Page 21: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/21.jpg)
Task Understanding Three Step Process• Step 1 Code Translation
• Step 2 Clone Detection
• Step 3 Visualization
Source Files
TranslatorCommon
Model
Common Model
InspectorDetected Clones
Detected Clones
UIClone
Visualization
21
![Page 22: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/22.jpg)
Grammar UpdatesJava & C#
22
![Page 23: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/23.jpg)
Grammar Updates Currently the grammars we have for the
Gold parser are out dated.
Current Gold Grammars◦ C# version 2.0◦ Java version 1.4
Current available software versions◦ C# version 4.0◦ Java version 6
![Page 24: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/24.jpg)
Grammar Updates Available updated grammars
◦ Antlr has grammars updated to more recent versions of both C# and Java.
◦ C# version 4.0 (latest version)◦ Java version 1.5 (second to latest version)
Currently we are attempting to transform the Antlr grammars into Gold Parser grammars.
![Page 25: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/25.jpg)
Grammar Update Issues Grammars for C# and Java are very
complex and require a lot of work to build.
Antler and Gold Parser grammars use completely different syntax.
Positive note: Other development not halted by use of older grammars.
![Page 26: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/26.jpg)
Clone AnalysisOverview and Dr. Kraft’s Student’s Tool
26
![Page 27: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/27.jpg)
Software Clones: (Definitions from Wikipedia)
◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity.
◦ Clones: sequences of duplicate code.
“Clones are segments of code that are similar according to some definition of similarity.”
—Ira Baxter, 2002
Software Clones
![Page 28: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/28.jpg)
How clones are created:◦ copy and paste programming
◦ similar functionality, similar code
◦ plagiarism
Software Clones (cont.)
![Page 29: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/29.jpg)
3 Types of Clones:◦ Type 1: an exact copy without modifications
(except for whitespace and comments).
◦ Type 2: a syntactically identical copy only variable, type, or function identifiers have
been changed.
◦ Type 3: a copy with further modifications statements have been changed, added, or
removed.
Software Clones (cont.)
![Page 30: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/30.jpg)
Per our task, in order to find clones across different programming languages, we will have to first convert the code from each language over to a language independent object model.
Some Language Independent Object Models:◦ Dagstuhl Middle Metamodel (DMM)◦ Microsoft CodeDOM
Both of these models provide a language independent object model for representing the structure of source code.
Introduction (cont.)
![Page 31: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/31.jpg)
Detecting clones across multiple programming languages is on the cutting edge of research.
A preliminary version of this was done by Dr. Kraft and his students for C# and VB.◦ They compared the Mono C# parser (written in C#) to the
Mono VB parser (written in VB).◦ Publication:
Nicholas A. Kraft, Brandon W. Bonds, Randy K. Smith: Cross-language Clone Detection. SEKE 2008: 54-59
Related Research
![Page 32: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/32.jpg)
Token sequence of CodeDOM graphs with Levenshtein distance◦ The Levenshtein distance between two sequences is
defined as the minimum number of edits needed to transform one sequence into the other
Performs Comparisons of code files CodeDOM tree is tokenized Based on Distances
◦ Percentage of matching tokens in a sequence
Dr. Kraft Approach
![Page 33: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/33.jpg)
Dr. Kraft Approach (cont)
![Page 34: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/34.jpg)
Only does file-to-file comparisons◦ Does not detect clones in same source file
Can only detect Type 1 and some Type 2 clones
Not very efficient (brute force)
Limitations
![Page 35: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/35.jpg)
Split into parameter (identifiers and literals) and non-parameter tokens
Non-parameter tokens summarized using a hash function
Parameter tokens are encoded using a position index for their occurrence in the sequence◦ Abstracts concrete names and values while
maintaining order
Enhancements
![Page 36: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/36.jpg)
Represent all prefixes of the sequence in a suffix tree
Suffixes that share the same set of edges have a common prefix◦ Prefix occurs more than once (clone)
Enhancements (cont)
![Page 37: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/37.jpg)
What’s been done
37
Demonstration
![Page 38: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/38.jpg)
Team CollaborationTeam 2 & Team 3
38
![Page 39: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/39.jpg)
Team Collaboration Team 2 & Team 3 Team 2
◦ We plan to start giving Team 3 periodic drops of our source code for Java and C# parsing.
◦ We are researching and working to update the Java and C# grammars.
Team 3◦ Team 3 is working on C++ parsing.
Looking into other parser, ELSA.
39
![Page 40: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/40.jpg)
Path ForwardNext Iteration & Schedule
40
![Page 41: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/41.jpg)
Finalize Iteration 1 (C++ to CodeDom) Iteration 2 (Code Analysis) Iteration 3 (Begin GUI)
Path Forward
![Page 42: Cross Language Clone Analysis Team 2 October 27, 2010.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649ece5503460f94bdbe3a/html5/thumbnails/42.jpg)
Schedule