journal for scribd

14
Research Article Computation of Program Source Code Similarity by Comp osit ion of Parse Tree and Call Graph Hyun-Je Song, Seong-Bae Park, and Se Young Park School o Computer Science and Engineering, Kyungpook National University, Daehakro, Bukgu, Daegu -, Republic o Korea Correspondence should be addressed to Seong-Bae Park; [email protected] .ac.kr Received September ; Accepted December Academic Editor: Jianjun Jiao Copyright © Hyun-Je Song et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproductio n in any medium, provided the original work is properly cited. Tis paper proposes a novel method to compute how similar two program source codes are. Since a program source code is represented as a structural orm, the proposed method adopts convolution kernel unctions as a similarity measure. Actually, a program source code has two kinds o structural inormation. One is syntactic inormation and the other is the dependencies o unction calls lying on the program. Since the syntactic inormation o a program is expressed as its parse tree, the syntactic simila rit y bet wee n two pr ogr amsis comput ed by a pa rse tre e ke rne l. Te uncti on cal ls wit hin a pr ogr am pr ovi de a glo bal structure o a program and can be represented as a graph. Tereore, the similarity o unction calls is computed with a graph kernel. Ten, bot h str uct ura l similarities are re ect ed simult aneously into compa rin g program source cod es by composin g the pa rsetree and the graph kernels based on a cyclomatic complexity. According to the experimental results on a real data set or program plagiarism detection, the proposed method is proved to be eective in capturing the similarity between programs. Te experiments show that the plagiarized pairs o programs are ound correctly and thoroughly by the proposed method. 1. Introduction Ma ny rea l-worl d da ta res ources such as web tab les and powerpo int templat es are usually represen ted in a struc - tural orm or providing more organized and summarized inormation. Since this type o data contains various and useul inormation in general, they are ofen an inormation source or data mining applications [ ]. However, there exist a number o duplications o the data nowadays due to the characteristics o digital data. Tis phenomenon makes data mining rom such data overloaded. Program source code is a data type that is duplicated requently like web tables and powerpoint templates. Tereore, plagiarism o program source codes becomes one o the most critical problems in education o computer science. Students can submit their program assignments by plagiarizing someone else’s work without any understanding o the subject or any reerence o the work []. However, it is highly impractical to detect all plagiarism pairs manually, especially when the size o source codes is considerably large. Tus, a method to remove such duplicates is required, and the rst step to do it is to measure the similarity b etween two source co des automatically . Program source code is one o the representative struc- tur ed data. Although a pro gra m source code look s lik e continuous strings, it is easily represented as a structural orm, that is to say , a p arse tree. A source code is represented as a tree structure afer compiling it with a grammar or a specic programming language. Tereore, the similarity bet ween pr ogr am source codes has to tak e the ir st uctures into cons idera tion. Man y prev ious stud ies have pro posed simila r- ity measures or program source code comparison, and most o them reect structural inormation to some extent [ ,  ]. Ho wev er , the re ar e some sho rtcomi ngs o the simila rit y mea - sures. First, they cannot reect entire structure o a source code, because they represent the structural inormation o the source code as structural eatures dened at lexical level. In order to overcome this problem, some studies consider the structural inormation at a structure level such as parse tree [] and un ction-cal l gr ap h []. Second, ther e is no st udy tocons ider a pars e tr ee an d a call gr aph at the same ti me even though each structure provides a dierent structural view on the program source code. A parse tree gives a relatively local structural view, while a call graph provides a high and global Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2015, Article ID 429807, 12 pages http://dx.doi.org/10.1155/2015/429807

Transcript of journal for scribd

Page 1: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 113

Research ArticleComputation of Program Source Code Similarity by Composition of Parse Tree and Call Graph

Hyun-Je Song Seong-Bae Park and Se Young Park

School o Computer Science and Engineering Kyungpook National University 983096983088 Daehakro Bukgu Daegu 983095983088983090-983095983088983089 Republic o Korea

Correspondence should be addressed to Seong-Bae Park sbparksejongknuackr

Received 983091983088 September 983090983088983089983092 Accepted 983089983093 December 983090983088983089983092

Academic Editor Jianjun Jiao

Copyright copy 983090983088983089983093 Hyun-Je Song et al Tis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Tis paper proposes a novel method to compute how similar two program source codes are Since a program source code isrepresented as a structural orm the proposed method adopts convolution kernel unctions as a similarity measure Actually aprogram source code has two kinds o structural inormation One is syntactic inormation and the other is the dependencieso unction calls lying on the program Since the syntactic inormation o a program is expressed as its parse tree the syntacticsimilarity between two programs is computed by a parse tree kernel Te unction calls within a program provide a global structureo a program and can be represented as a graph Tereore the similarity o unction calls is computed with a graph kernel Tenboth structural similarities are re1047298ected simultaneously into comparing program source codes by composing the parse tree and thegraph kernels based on a cyclomatic complexity According to the experimental results on a real data set or program plagiarism

detection the proposed method is proved to be effective in capturing the similarity between programs Te experiments show thatthe plagiarized pairs o programs are ound correctly and thoroughly by the proposed method

1 Introduction

Many real-world data resources such as web tables andpowerpoint templates are usually represented in a struc-tural orm or providing more organized and summarizedinormation Since this type o data contains various anduseul inormation in general they are ofen an inormationsource or data mining applications [983089] However there exista number o duplications o the data nowadays due to the

characteristics o digital data Tis phenomenon makes datamining rom such data overloaded Program source codeis a data type that is duplicated requently like web tablesand powerpoint templates Tereore plagiarism o programsource codes becomes one o the most critical problems ineducation o computer science Students can submit theirprogram assignments by plagiarizing someone elsersquos work without any understanding o the subject or any reerence o the work [983090] However it is highly impractical to detect allplagiarism pairs manually especially when the size o sourcecodes is considerably large Tus a method to remove suchduplicates is required and the 1047297rst step to do it is to measurethe similarity between two source codes automatically

Program source code is one o the representative struc-tured data Although a program source code looks likecontinuous strings it is easily represented as a structuralorm that is to say a parse tree A source code is representedas a tree structure afer compiling it with a grammar ora speci1047297c programming language Tereore the similarity between program source codes has to take their stuctures intoconsideration Many previous studies have proposed similar-

ity measures or program source code comparison and mosto them re1047298ect structural inormation to some extent [983091 983092]However there are some shortcomings o the similarity mea-sures First they cannot re1047298ect entire structure o a sourcecode because they represent the structural inormation o the source code as structural eatures de1047297ned at lexical levelIn order to overcome this problem some studies considerthe structural inormation at a structure level such as parsetree [983093] and unction-call graph [983094] Second there is no study to consider a parse tree and a call graph at the same time eventhough each structure provides a different structural view onthe program source code A parse tree gives a relatively localstructural view while a call graph provides a high and global

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 429807 12 pageshttpdxdoiorg1011552015429807

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 213

983090 Mathematical Problems in Engineering

level structural view Since both views are useul to detectplagiarized pairs o program source codes the similarity measure or program source code comparison should re1047298ectboth kinds o structural inormation simultaneously

Tis paper proposes a novel method to calculate thesimilarity between two program source codes Te proposed

method adopts two kinds o structural inormation based onkernel unctions A kernel unction is one o the prominentmethods or comparing structured data [983095] It can be used asa similarity measure since it calculates an inner product o two elements [983096] Te proposed method re1047298ects the syntacticstructure o a program source code using a parse tree kernel[983097] A parse tree kernelcomputesthe similarity between a pairo parse trees Tus the syntactic structural inormation o asource code is ully re1047298ected into the proposed method by it Te proposed method consider also a dynamic structureo a source code by adopting a graph kernel [983089983088] Te graphkernel in the proposed method computes the similarity valuebetween a pair o unction-call graphs Since these twokernels are instances o

-convolution kernels by Haussler

[983095] they compare trees and graphs efficiently without expliciteature enumeration respectively

Each kernel produces its own similarity based on itsown structural view Te proposed method incorporatesbothkinds o structural inormation into program source codecomparison by composing the parse tree kernel andthe graphkernel into a composite kernel Since the proposed compositekernel is based on a weighted sum composition optimizingthe weights o base kernels is a crucial issue In this paper theweights are automatically determined with the considerationo the complexity o source codes Tus i any two programsource codes are given the proposed method can computetheir similarity

Te proposed method is evaluated on source code pla-giarism detection with a real data set used in the work [983093]Our experiments show three important results First thesimilarity measure based on the parse tree kernel is morereliable than that based on the graph kernel in terms o overall perormance Second the more complicated a sourcecode the more useul the similarity based on the graphkernel in detecting plagiarized pairs Finally the proposedmethod which combines the parse tree kernel and the graphkernel detects plagiarism o real-world program source codessuccessully Tese results prove that global-level structuralinormation is an important actor or comparing programsand that the proposed similarity measure that combines

syntactic and dynamic structural inormation results in goodperormance or program source code plagiarism detection

In summary we draw the ollowing contributions in thispaper

(983089) We design and implement a source code similarity measure or plagiarism detection based on two kindso structural inormation syntactic inormation anddependencies o unction calls From the act that thechange o the structure o the source code is harderthan the one o the user-de1047297ned vocabulary theproposed method is robust or detecting plagiarismpairs

(983090) In order to make use o two kinds o structural inor-mation simultaneously we design new combinationmethod based on a complexity o source code Tismakes the proposed method work more robustly eveni we compare between complicated source codes

Te rest o the paper is organized as ollows Section 983090is devoted to related studies on program source code com-parison and program source code plagiarism detectionSection 983091 introduces the problems o source code plagiarismdetection Te similarity measure based on parse tree kerneland unctional-call graph kernel are given in Sections 983092and 983093 respectively Section 983094 proposes the composite kernelthat combines the parse tree kernel and the graph kernelSection 983095 explains the experimental settings and the resultsobtained by the proposed method Finally Section 983096 drawsthe conclusion

2 Related Work

Measuring the similarity between two objects is undamen-tal and important in many scienti1047297c 1047297elds For instancein molecular biology it is ofen required to measure thesequence similarity between protein pairs Tus many sim-ilarity measures have been proposed such as distance-basedmeasurements including Euclidean distance and Levenshteindistance mutual inormation [983089983089] inormation content usingwordNet [983089983090] and citation-based similarity [983089983091] In additionthe measures have been applied to various applications suchas inormation retrieval [983089983092 983089983093] and clustering [983089983094] as theircore part

Te similarity measure or source codes have been o interest or a long time Most early studies are based onattribute-counting similarity [983089983095 983089983096] Te similarity repre-sents a program as a vector o various elements such as thenumber o operators and operands Ten a vector similarity is normally used to detect plagiarized pairs However theperormance o this approach is relatively poor compared toother methods that consider structure o source codes sincethis approach uses only the abstract-level inormation

In order to overcome the weaknesses o the attribute-counting-metric approach some studies incorporate thestructural inormation o source code into their similarity measure In general the structure o source codes is a treeor a graph From the act that a source code is compiledinto a syntactic structure as described by the grammar o a

programming language some studies used a tree matchingalgorithm to calculate the similarity between source codes[983092 983089983097] However the algorithm represents a source codeas a string that contains certain structural inormation sothat it ails in re1047298ecting an entire structure o a source codeinto a similarity measure On the other hand some otherstudies used the knowledge that comes rom the topology o source codes Horwitz 1047297rst adopted graph structures orcomparing two programs [983091] and determined which compo-nents are changed rom one to another based on the programdependency graph [983090983088] Liu et al also used the programdependency graph to represent a sourcecode andadopted therelaxed subgraph isomorphismtesting to compare two source

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 313

Mathematical Problems in Engineering 983091

codes efficiently [983090983089] Kammer built a plagiarism detectiontool or Haskell language [983094] He extracted a call-graph 1047297rstrom a source code Te nodes in the graph are unctions andan edge indicates that one unction calls another unctionTen he transormed the graph into a tree to comparesource codes efficiently Finally he applied the

1038389lowast-based tree

edit distance and the tree isomorphism algorithm or thecomparison o source codes However this approach loses

much inormation lying on a graph since the graph is trans-ormed into a tree Lim et al proposed a method o detectingplagiarism o Java programs through analysis o 1047298ow pathso Java bytecodes [983090983090] Since a 1047298ow path o a program isa sequence o basic blocks upon execution o the programthey tried to align 1047298ow paths using a semiglobal alignmentalgorithm and then detected program source code plagiarismpairs Chae et al also tried to detect binary program(executable 1047297le) plagiarism [983090983091] Tey constructed 1047297rst A-CFG (API-labeled control 1047298ow graph) that is the unctionalabstraction o a program and then generated a vector o

a prede1047297ned dimension rom the A-CFG using MicrosofDevelopment Network Application Programming Interace(MSDN API) to avoid computational intractability Finallythey used a random walk (page-rank) algorithm to calcu-late the similarity between programs Unortunately thisapproach cannot be applied to other languages that do nothave MSDN API Recently some studies have used stopword1103925-grams [983090983092] and topic model [983090983093] to measure the similarity

Several program source code plagiarism detection toolsare available online Most o them use a string tokenizationand a string matching algorithm to measure the similarity between source codes Prechelt et al proposed JPlag It is asystem that can be used to detect plagiarism o source codeswritten in C C++ Java and Scheme [983090983094] It 1047297rst extractedtokens rom source codes and then compared tokens rom

each source code using Karp-Rabin Greedy String ilingalgorithm Another widely-used plagiarism detection systemis MOSS (MeasureO Sofware Similarity) proposed by Aiken[983090983095] Itis also based on a string-matching algorithmIt dividesprograms into 907317-grams where a 907317-gram is a contiguoussubstring o length 907317 Ten the similarity is determined by the number o same 907317-grams shared by the programs Oneo the state-o-the-art and well-known plagiarism detectionsystems is CCFinder proposed by Kamiya et al [983090983096] It usesboth attribute-counting-metric and structure inormation A

source code is transormed into a set o normalized tokensequences by its own transormation rules Te transor-mation rules are constructed manually or each languageto express structural characteristics o languages Ten thenormalized tokens are compared to reveal clone pairs insource codes Tey showed relatively good perormance andused structural inormation to some degree but it does notre1047298ect the structural inormation o source codes into itssimilarity measure ully

Te proposed method in this paper extends the kernel-based method proposed by Son et al [983093] Tey compared thestructure o source codes using a kernel unction directlyTey used the parse tree kernel especially [983097] a kind

o -convolution kernels [983095] to compare tree structureo source codes Compared to this work the proposedmethod incorporates unction-call inormation additionallyTe unctional calls are one o the important structuralinormation in comparing source codes Te main problemo Son et al is that they ocused only on syntactic structural

inormation that is local and static On the other hand theunction-call inormation provides a global view on sourcecode execution Tereore the plagiarized pairs o sourcecodes are detected more accurately by considering not only the syntactic structure but also the unction-call inormation

3 Program Source Code Plagiarism Detection

Plagiarism detection or program source codes also knownas programming plagiarism detection aims to detect plagia-rized source code pairs among a set o source codes Tesource code plagiarism detection normally consists o three

steps as illustrated in Figure 983089 Te 1047297rst step is a preprocessingstep that extracts eatures such as tokens and parse trees romsource codes Te second step calculates pair-wise similarity with the extracted eatures and a similarity measure Tere-ore the similarity values among all pairs are recorded into asimilarity matrix Finally the groups o source codes that aremost likely to be plagiarized are selected according to theirsimilarity values

Formally let be a set o source codes Plagiarismdetection aims to generate a list o plagiarized source codesbased on a similarity sim() between isin and isin I the similarity o a pair is higher than a prede1047297ned threshold

the pair is determined as a plagiarized one Tereore or

a source code isin a set o plagiarized source codes isde1047297ned as

= isin | sim ge = 1048701 (983089)

Te similarity measure sim( ) is decided by inormationtype extracted rom source codes A source code has twokinds o inormation lexical and structural inormation Lex-ical inormation corresponds to variables and the reserved

words like public i and or Tis vocabulary is composed o a large set o rarely occurring user-de1047297ned words (variables)and a small set o requently occurring words (reservedwords) On the other hand structural inormation corre-sponds to a structure that is determined by reserved wordsAmong them structural inormation is a more importantclue or detecting plagiarism since tokens can be easily con-

verted into other tokens without understanding the subjecto a source code Tereore this paper ocuses on structuralinormation Note that a source program has two kinds o structural inormation One is syntactic structure that isusually expressed as a parse tree and the other is unction-call graph structure

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 413

983092 Mathematical Problems in Engineering

Sourcecodes

Preprocessor

Similaritymatrix

Plagiarismdetectionmodule

Plagiarizedsourcecodespair

Similarity measure

Parse trees

Function call graphs

okens

Dependency graphs

F983145983143983157983154983141 983089 Te overall process o program source code plagiarism detection

expressionexpression

CompilationUnit

typeDeclaration ltEOFgt

classOrInterfaceModi1047297er

public

classDeclaration

class Fibo classBody

classBodyDeclaration

modi1047297er

classOrInterfaceModi1047297er

public

memberDeclaration

methodDeclaration

type

primitiveype

int

formalParameters

(

formalParameter

type

methodBody

modi1047297er

classOrInterfaceModi1047297er

static rFib onacci

formalParameter

number

primitiveype

int

block

blockStatement

statement

blockStatement

if parExpression statement

expression

||

( )

expression

expression expression

primary

number

primary

literal

1

expression

==expression expression

primary

number

primary

literal

2

blockStatement

statement

return statement

block

blockStatement

statement

expressionreturn

primary

literal

1

expression

(expression )

primary

rFibonacci

expressionList

-expression

primary

number

primary

literal

1

expression

(expression )

primary

rFibonacci

expressionList

-expression

primary

number

primary

literal

2

983223 983223 983223 983223 983223 983223

formalParameterList )

==

+

F983145983143983157983154983141 983090 A parse tree extracted rom the source code in Box 983089

4 Similarity Measure for Source CodesBased on Parse Tree Kernel

983092983089 Source Code as a ree Te program source code canbe naturally represented as a parse tree o which each nodedenotes variables reserved words operators and so onFigure 983090 shows an example parse tree extracted rom a Javacode in Box 983089 (this parse tree is slightly different rom theparse tree used in Son et al [983093] Tis is because a more recent

version o Java grammar is used in this paper) Te Java codein Box 983089 implements the Fibonacci sequence Due to the lack o width o paper only one unction rFibonacci is shown

in Figure 983090 while there exist 1047297ve unctions in Box 983089 Asshownin this algorithm a parse tree rom a simple source code canbe very large and deep-rooted

In this paper we use ANLR (another tool or languagerecognition) (httpwwwantlrorg) to extract a parse treerom a source code ANLR proposed by Parr and Quong isa language tool that provides a ramework or constructingrecognizers interpreters compilers and translators romgrammatical descriptions [983090983097] With ANLR and a languagegrammar a tree parser that translates a source code into aparse tree can be easily constructed

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 513

Mathematical Problems in Engineering 983093

public class Fibo public static int rFibonacci(int number)

if (number == 983089 number == 983090) return 983089

return rFibonacci(number-983089) + rFibonacci(number-983090)

private static int sum(int value983089 int value983090) return value983089 + value983090

public static int iFibonacci(int number) if (number == 983089 number == 983090)

return 983089int 1047297bo983089 = 983089 1047297bo983090 = 983089int 1047297bonacci = initOne()for (int i = 983091 i lt= number i++)

1047297bonacci = sum(1047297bo983089 1047297bo983090)1047297bo983089 = 1047297bo9830901047297bo983090 = 1047297bonacci

return 1047297bonacciprivate static int initOne()

return 983089public static void main(String[] args)

int rFibo = FiborFibonacci(983095)int iFibo = FiboiFibonacci(983095)Systemout println(rFibo)Systemout println(iFibo)

B983151983160 983089 An example o Java source code

Since parse tree has syntactic structural inormation ametric or parse tree that re1047298ects entire structural inorma-tion is required Te parse tree kernel is one o such metricsIt compares parse trees without manually designed structuraleatures

983092983090 Parse ree Kernel Parse tree kernel is a kernel that is

designed to compare tree structures such as parse trees o natural language sentences Tis kernel maps a parse treeonto a space spanned by all subtrees that can appear possibly in the parse tree Te explicit enumeration o all subtreesis computationally ineasible since the number o subtreesincreases exponentially as the size o tree grows Collins andDuffy proposed a method to compute the inner product o two trees without having to enumerate all subtrees [983097]

Let subtree1 subtree2 be all o the subtrees in a parsetree Ten can be represented as a vector

= ⟨subtree1 () subtree2 () subtree1038389 ()⟩ (983090)

where subtree1103925() is the requency o subtree1103925 in the parsetree Te kernel unction between two parse trees 1 and2 is de1047297ned as tree(1 2) = 12 and is determined as

tree 9830801 2983081 = 12= sum1103925 subtree1103925 9830801983081 sdot subtree1103925 9830802983081= sum

1103925

1048616 sum9073171isin1

subtree1038389983080110392519830811048617

sdot 1048616 sum9073172isin2

subtree1038389983080110392529830811048617

= sum9073171isin1

sum9073172isin2

98308011039251 11039252983081

(983091)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 613

983094 Mathematical Problems in Engineering

where 1and 2

are all the nodes in trees 1 and 2 Teindicator unction subtree1038389

(1103925) is 983089 i subtree1103925 is rooted at node1103925 and 983088 otherwise (11039251 11039252) is a unction which is de1047297ned as

98308011039251 11039252983081 = sum1103925

subtree103838998308011039251983081 sdot subtree1038389

98308011039252983081 (983092)

Tis unction can be calculated in polynomial time using theollowing recursive de1047297nition

(i) I the productions at 11039251 and 11039252 are different

98308011039251 11039252983081 = 0 (983093)

(ii) I both 11039251 and 11039252 are preterminals

98308011039251 11039252983081 = 1 (983094)

(iii) Otherwise the unction can be de1047297ned as ollow

98308011039251

11039252

983081 = nc(9073171)

prod1103925 9830801 + 983080ch1103925

98308011039251

983081 ch1103925

98308011039252

983081983081983081 (983095)

where nc(11039251) is the number o children o node 11039251 inthe tree

Since the productions at 11039251 and 11039252 are the same nc(11039251)is also equal to nc(11039252) Here ch1103925(11039251) denotes the th childnode o 11039251 Tis recursive algorithm is based on the act thatall subtrees rooted at a certain node can be constructed by combining the subtrees rooted at each o its children

983092983091 Modi1047297ed Parse ree Kernel Te parse tree kernel hasshown good perormance or parse trees o natural language

but it does not work well or program source code compari-son due to two issues Te 1047297rst issue is asymmetric in1047298uenceo node changes Te parse tree rom a source code tends tobe much larger and deeper than that rom a natural languagesentence Tereore the changes near root node have beenre1047298ected more ofen than the changes near lea nodes Tesecond issue is the sequence o subtrees Te original parsetree kernel counts the sequence o subtrees by consideringtheir order However the order o two substructures in asource code is meaningless in programming languages

Son et al proposed a modi1047297ed parse tree kernel to copewith these issues [983093] In order to solve the 1047297rst issue they introduced a decay actor tree and a threshold Δ that controlthe effect o large subtrees Te decay actor scales the relativeimportance o subtrees by their size As the depth o a subtreeincreases the kernel value o the subtree is penalized by (tree)size where size is the depth o the subtree In additionthe limitation o the maximum depth o possible subtrees isset as Δ so that the effect o large subtrees could be reducedTe second issue is solved by changing unction in (983095) toignore the order o two nodes

With a decay actor tree and a threshold Δ the recursiverules o the parse tree kernel is modi1047297ed as ollows

(i) I 11039251 and 11039252 are different

98308011039251

11039252

983081 = 0 (983096)

(ii) I both 11039251 and 11039252 are terminals or the current depth isequal to Δ

98308011039251 11039252983081 = tree (983097)

Equation (983095) cannot be used with these new recursive rules

since the number o child nodes can be different in 11039251 and 11039252Tus we adopt the maximum similarity between child nodesAs a result the unction in (983095) becomes

98308011039251 11039252983081 = tree

nc(9073171)prod1103925

8520081 + maxchisinch11039252

983080ch1103925 98308011039251983081 ch983081852009 (983089983088)

where ch9073172is a set o child nodes o 11039252

Te parse tree kernel with the modi1047297ed unctionmptdoes not satisy Mercerrsquos condition However many unctionsthat do not satisy Mercerrsquos condition [983091983088 983091983089] work wellin computing similarity [983091983090] Finally this parse tree kernelis used as the similarity measure sim

( ) in (983089) or syntactic

structural comparison o source codes

5 Similarity Measure for Source CodesBased on Graph Kernel

983093983089 Source Code as a Graph Recently program sourcecodes are written with object-oriented concepts and severalreactoring techniques so that the codes are getting more andmore modularized at unctional level Since a source codeencodes program logic to solve a problem the execution 1047298ow at unction level is one o the important actors to identiy the source code Tereore this unction-level 1047298ow should beconsidered to compare source codes

One possible representation or the unction-level 1047298ow isa unction-call graph which represents dependencies amongunctions within a program Let be a source code Ten a

unction calls graph = ( ) is a directed graph extractedrom where V isin is a unction in Tus || is the numbero unctions in is a set o edges and each edge represents adependency relation between unctions Tat is an edge 1103925 isin that connects nodesV 1103925 and V implies that a unction V 1103925 callsanother unction V Te weight o an edge is given by

1103925 = 9831631 i V 1103925 calls V 0 otherwise (983089983089)

Figure 983091 illustrates an example unction-call graphextracted rom the Java code in Box 983089 Tis code contains 1047297veunctions including main First main calls rFibonacciand iFibonacci in order Since rFibonacci is arecursive unction it calls itsel iFibonacci calls twounctions initOne and sum to initialize a variable and geta sum Finally main calls println to print out the results

A rule-based approach is adopted to extract a call graphrom a source code Simple rules are used to 1047297nd thecaller-callee relationship rom a parse tree For instance inJava a rule ldquoif lsquoexpression (expresionList)rsquo is oundthen expression is a called unction namerdquo is used to1047297nd subtrees rom a parse tree Ten unction names and

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 713

Mathematical Problems in Engineering 983095

rFibonacci

iFibonacci

main

initOne

sum

println

F983145983143983157983154983141 983091 Example o extracted call graph rom Java code

parameters are extracted rom the matched subtrees Tenodes or the extracted unction names are connected to thecaller node

Function-call graph o a program depicts how the pro-gram executes at unction level and how unctions are relatedto one another Since the 1047298ow o a program is quite uniqueaccording to the task the similarity between two sources canbe calculated using the 1047298ows o the programs Since this 1047298ow is represented as a unction-call graph the graph kernel is thebest method to compare unction-call graphs It showed goodperormance in several 1047297elds including biology and social

network analysis

983093983090 Graph Kernel Graph kernel is a kernel thatis designed tocompare graph structures Like the parse tree kernel a graphis mapped onto a eature space spanned by their subgraphs inthe graph kernel Te intuitive eature o the graph kernel isgraph isomorphism that determines the topological identityAccording to Gartner et al [983089983088] however it is as hard asdeciding whether two graphs are isomorphic to compute any complete graph kernel with an injective mapping unctionor all graphs where graph isomorphism is a NP-completeproblem [983091983091] Tus most graph kernels ocus on alternativeeature representation o graphs

Te random walk graph kernel is one o the most widely used graph kernels It uses all possible random walks aseatures or graphs Let be a set o all possible random walksandW907317() denotes the set o all possible walks with 1103925 edgesin graph For each random walk isin whose length is 1103925the corresponding eature mapping unction o a graph isgiven as

Φ () = radic907317graph1048699 isinW907317 () forall 9830801103925983081 = 9830801103925983081983165 (983089983090)

where 907317graph is a weight or the length 1103925 and (1103925) and (1103925)are the

th label o the random walk

and

respectively

Te kernel unction between two graphs 1 and 2 denotedby graph(1 2) can be de1047297ned as

graph 9830801 2983081 = sum

Φ 9830801983081 sdotΦ 9830802983081 (983089983091)

Gartner et al proposed an approach to calculate all ran-domwalks within two graphswithout explicit enumeration o all random walks [983089983088] A direct product graph o two graphs1 = (1 1) and 2 = (2 2) denoted by 1 times 2 =(times times) where times is its node set and times is its edge set isde1047297ned as ollows

times 9830801 times 2983081 = 1048699983080V 1 1983081 isin 1 times 2 | 983080V 1983081 = 9830801983081983165 times 9830801 times 2983081 = 1048699983080983080V 1 1983081 983080V 2 2983081983081 isin times 9830801 times 2983081 |

983080V 1 V 2983081 isin 1 9830801 2983081 isin 2

983080V 1

V 2

983081 = 9830801

2

983081983165

(983089983092)

where (V ) is the label o a node V and ( ) is the label o an edge between node and node Based on the directproduct graph the random walk kernel can be calculated

Let 1038389times isin R|times|times|times| denote an adjacency matrix o the directproduct1 times2 With a weighting actorgraph ge 0graph in(983089983091) can be rewritten as

graph 9830801 2983081 = |times|sum1103925=1

1048667infinsum907317=0

907317graph1038389907317times1048669

1103925

(983089983093)

Tis random walk kernel can be computed in (11039253

) usingSylvester equation or Conjugate Gradient method where 1103925 isthe number o nodes [983091983092]

983093983091 Modi1047297ed Graph Kernel When a graph kernel is used tocompare source codes goodperormance is not expected dueto theact that the graph kernel measures similaritiesbetweenwalks with an identical label Since the labels (unctionnames) o nodes within the unction-call graph are decidedby human developers they are seldom identical even i thesource codes are simple Tereore the graph kernel has toconsider nonidentical labels

Borgwardt et al modi1047297ed the random walk kernel to

compare nonidentical labels by changing the direct productgraph to include all pairs o nodes and edges [983091983093] Assumethat nodes are compared by a node kernelnode and edges arecompared by an edge kernel kedge edge Tat is node(V )calculates the similarity between two labels rom the nodes V

and andedge((V 1103925 V 1103925+1)(1103925 1103925+1)) computes the similarity between two edges (V 1103925 V 1103925+1) and (1103925 1103925+1) With these twokernels the random walk kernel between two unction callgraphs 1 and 2 is now de1047297ned as

mg 9830801 2983081 = 907317minus1sum1103925=1

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 (983089983094)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 813

983096 Mathematical Problems in Engineering

where

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081= node 983080V 1103925 1103925983081 sdot node 983080V 1103925+1 1103925+1983081

sdot edge

983080983080V 1103925

V 1103925+1

983081 9830801103925

1103925+1

983081983081

(983089983095)

I this modi1047297ed random walk kernel is used or thecomparison o source code the node kernel node and theedge kernel edge should be de1047297ned Note that the labelso edges in unction-call graph are binary values by (983089983089)Tus edge is simply designed to compare binary valuesTe simplest orm or node(V ) is a unction that returns983089 when V and have similar string patterns 983088 otherwiseTat is it returns 983089 i a distance between V and is smallerthan a prede1047297ned threshold In this paper we simply useLevenshtein distance as the distance and set the threshold as983088983093

Te modi1047297ed random walk kernel mg can be alsocomputed using (983089983093) However the adjacency matrix

1038389times o

the direct product 1 times 2 should be modi1047297ed as

9831311038389times983133(V 10383891038389)(V 907317907317)

= 852091step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081

i 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 isin times0 otherwise(983089983096)

wheretimes isashortormo times(1times2) and the edges (V 1103925 V 1103925+1)and (1103925 1103925+1) belong to 1 and 2 respectively As in parsetree kernel this modi1047297ed graph kernel is used to comparesource codes as similarity measure sim( ) in (983089)

6 Similarity Measure for Source CodesBased on a Composite Kernel

Te modi1047297ed parse tree kernel manages syntactic structuralinormation whereas the modi1047297ed graph kernel considershigh-level topological inormation o source codes In orderto make use o both kinds o inormation the compositiono the two kernels is required Cristianini and Shawe-aylorproved that a new kernel can be obtained by combiningexisting several kernels with some closure properties suchas weighted sum and multiplication [983091983094] Among variousclosure properties this paper adopts the weighted sum sinceit is simple and widely used

Beore combining two kernels the kernels should benormalized since the modi1047297ed parse tree kernel mpt andthe modi1047297ed graph kernelmg are not bound Tereore onekernel can dominate other in their composition In order toremove this effect the kernels are 1047297rst normalized When akernel() is given itsnormalized one()isde1047297nedas

= radic ( ) sdot 983080 983081 (983089983097)

Tereore () is bounded between 983088 and 983089 Tat is 0 le

(

) le 1

Our composite kernelco is composed o the normalizedmodi1047297ed parse tree kernel

mpt and the normalized modi1047297ed

graph kernelmg Tatis the composite kernel

co or given

two source codes and is de1047297ned as

co

= 9830801 minus 983081 sdot mpt

983080

983081 + sdot mg

983080

983081 (983090983088)

where is a mixing weight between two kernels and areparse trees extracted rom source codes and respectivelyand and are call graphs rom and respectively Telarger is the more signi1047297cant mpt is On the other handas the value o gets small the graph kernel mg is moresigni1047297cant than the parse tree kernel mpt Tis compositekernel is used as our 1047297nal similarity measure sim( ) in (983089)

Te parse tree kernel compares source codes with local-level view since it is based on subtree comparison Most pla-giarized source codes change a small portion o the originalsource code Tus the parse tree kernel has shown goodperormances in general However it does not re1047298ect the 1047298ow o the program which is dynamic structural inormationTegraph kernel on the other hand calculates the similarity interms o dynamic high-level view Tus when source codesconsist o a number o unctions the graph kernel achievesreasonable perormance As a result should be determinedby the complexity o source codes since it is a parameter tocontrol the relative importance between the parse tree kerneland the graph kernel

Tere are many methodsthat measure the complexity o asource code One widely-used method is the cyclomatic com-plexity proposed by McCabe [983091983095] Te cyclomatic complexity is a graph-theoretic quantitative metric and measures thenumber o paths within a source code It is simply calculated

using a control 1047298ow graph o a source code where the nodeso the graph correspond to entities o the source code andan (directed) edge between two nodes implies a dependency relation between entities Given the control 1047298ow graph o asource code () the cyclomatic complexity o source code is de1047297ned as

() = minus + 2 (983090983089)

where is the number o edges o the graph is the numbero nodes and is the number o connected components Telarger is the more complicated source code is

In this paper we measure the complexity o a sourcecode using itsunction-call graph Since a unction-call graphrepresents dependencies among unctions within a programit can be considered as a kind o control 1047298ow graphs whereentities o the source code are the unction in the source codeand the edge implies dependencies between unctions

Let() and() be the cyclomatic complexities o twosource codes and respectively Since is the weight o two normalized kernels it has to normalize between 983088 and 983089Te sigmoid unction is de1047297ned or all real input values andreturns a positive value between 983088 and 983089 Tus the sigmoidunction is adopted or o (983090983088) and is de1047297ned as

= 1

1 + minus(min(()())minus25)

(983090983090)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 913

Mathematical Problems in Engineering 983097

983137983138983148983141 983089 Simple statistics on the real data set

Inormation Value

Number o total assignments 983091983094

Number o submitted source codes 983093983093983093

Average number o submitted codes per assignment 983089983093983092983090

Minimum number o lines in source code 983092983097Maximum number o lines in source code 983090983096983094983091

Average number o lines per source code 983091983088983093983088983095

Minimum number o nodes in source code 983089983090

Maximum number o nodes in source code 983092983092983095

Average number o nodes in source code 983094983092983090983097

Number o marked plagiarism pairs 983089983095983093

where min() returns the minimum value between and According to (983090983090) as the cyclomatic complexity gets larger also increases is to be 983088983093 when thecyclomatic complexity o source code is 983090983093 Tis indicates

that when the cyclomatic complexity o source code is983090983093 the parse tree kernel and the graph kernel have anequal importance in the composite kernel A number o source code analysis applications regard source codes whosecyclomatic complexity is more than 983090983093 as complicated codes(httpmsdnmicrosofcomen-uslibraryms983089983096983090983090983089983090aspx )Tus we set 983090983093 as the equal point o the importance betweenthe parse tree kernel and the graph kernel

7 Experiments

983095983089 Experimental Settings For experiments the same dataset in the work o Son et al [983093] is used Tis data setis collected rom actual programming assignments o Javaclasses submitted by undergraduate students rom 983090983088983088983093 to983090983088983088983097 able 983089 shows simple statistics o the data set Te totalnumber o programming assignments is 983091983094 and the numbero submitted source codes is 983093983093983093 or the 983091983094 assignments Tusthe average number o source codes per an assignment is983089983093983092983090

Figure 983092 shows the histogram o the source codes perlines Te -axis is the number o program lines and the -axis represents the number o source codes As shown in this1047297gure about 983095983093 o source codes are written with less than983092983088983088 lines Te minimum number o lines o a source code is983092983097 and the maximum is 983090983096983094983091 Te average number o lines

per code is 983091983088983093983088983095In our data set the minimum number o unctions within

a program is 983089983090 whereas the maximum number is 983092983092983095Te programs with larger number o programs are paintprograms with any buttons In the paint programs studentsare required to set a layout manually with raw unctionssuch assetBounds Tus paint programs have a number o unctions Te average number o unctions is 983094983092983090983095

wo annotators created the gold standard or this dataset Tey investigated all sourcecodes and markedplagiarizedpairs manually In order to measurethe reliability andvalidity o the annotators Cohenrsquos kappa agreement [983091983096] is measuredTe kappa agreement o the annotators is

= 093 whichalls

180

160

140

120

100

80

60

40

20

00 500 1000 1500 2000 2500 3000

Source code lines

N u m

b e r o f s

o u r c e c o

d e s

F983145983143983157983154983141 983092 Histogram o source lines per code

on the category o ldquoalmost perect agreementrdquo Only the pairs judged as plagiarized pairs by both annotators are regardedas actual plagiarized pairs In total 983089983095983093 pairs are marked asplagiarized pairs

Tree metrics are used as evaluation measure precisionrecall and 1-measure Tey are calculated as ollows

Precision = o correctly detected plagiarized pairs

o detected plagiarized pairs

Recall = o correctly detected plagiarized pairs

o true plagiarized pairs

1-measure = 2 sdot precision

sdot recall

precision + recall (983090983091)

In order to evaluate the proposed method severalbaseline systems are used It is compared with JPlag andCCFinder In all experiments or the parse tree kernel thethreshold or subtree depth Δ is set as 983091 and the decay actortree is 983088983089 Te decay actor graph or graph kernel is set tobe 983088983089 empirically in (983090983089) is set to be 983089 because each sourcecode in our data set is a single program

983095983090 Experimental Results Beore evaluating the perormanceo plagiarism detection we 1047297rst examine relatedness between

the number o source code lines and the cyclomatic complex-ity Tis examination tries to show that (983090983090) is easible Since is determined with cyclomatic complexity it is expected to beproportionalto cyclomatic complexity Figure 983093 shows scatterplot between the number o lines and cyclomatic complexityAs shown in this 1047297gure they are highly correlated with eachother in our data set Te Pearson correlation coefficient is983088983095983089983092 Tis result implies that it is easible to set in (983090983090) to beproportional to cyclomatic complexity

In order to see the effect o threshold in (983089) in ourmethod the perormances are measured according to the

values o Figure 983094 shows the perormance o the pro-posed method or various

As

increases the precision

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1013

983089983088 Mathematical Problems in Engineering

600

500

400

300

200

100

0

C y c

l o m a t i c

c o m p l e x i t y

0 500 1000 1500 2000 2500 3000

Number of lines

F983145983143983157983154983141 983093 Scatter plot according to the number o lines o sourcecodes and the corresponding cyclomatic compelxity

10

09

08

07

06

05

086 088 092090 094 096 098

RecallPrecision

F -measure

Te value of threshold

F983145983143983157983154983141 983094 Perormance o the proposed system or real-world dataset

also increases while the recall decreases slightly Te bestperormance is achieved at = 096 with 983088983096983095 o 1-measureTus

= 096 is used at all the experiments below

Figure 983095 compares the proposed method with variouskernels according to the number o source code lines In this1047297gure the -axis is the number lines o source code andthe -axis represent the average 1-measure As shown inthis 1047297gure the original graph kernel shows the worst per-ormance Since it uses only the graph structure o unctioncalls it ofen ails in calculating the similarity among sourcecodes For example assume that there are two source codesIn one source code main calls a unction add and addcalls another unction multiply In the other source code main calls multiply and multiply calls add Tese twosource codes are the same under the graph kernel since thelabel inormation is ignored by the graph kernel Without

Graph kernel

Parse tree kernel

Modi1047297ed graph kernel

Proposed method

10

08

06

04

02

00

A v e r a g e F 1

- m e a s u r e

lt100 lt200 lt300 lt400 gt400

Number of lines

F983145983143983157983154983141 983095 Average 1-measure according to the number o source

code lines

the labels these two graphs are identical On the other handthe modi1047297ed graph kernel utilizes the label inormation As aresult it achieves better perormance than the graph kernel

Te parse tree kernel achieves higher perormance thanother methods or the source codes with less than 983091983088983088 linesWhen the number o lines in source codes is small theplagiarized codes are ofen made by changing the originalone locally Tus the parse tree kernel detects plagiarizedpairs accurately or the codes with small number o linesWhen source codes have more than 983091983088983088 lines the modi1047297edgraph kernel shows slightly better perormance than theparse tree kernel Tis result implies that high-level structuralinormation is another actor to compare (large) source codesand the modi1047297ed graph kernel can re1047298ect this structuralinormation well

Te proposed method that combines the parse tree kerneland the modi1047297ed graph kernel achieves the best perormanceor all source codes except those with 300 sim 400 linesSince the cyclomatic complexity o source codes with 300 sim400 lines is near 983090983093 the proposed method re1047298ects the parsetree kernel and the modi1047297ed graph kernel equally Tusit achieves an average perormance o the kernels By thecyclomatic complexity o source codes the proposed method

is more in1047298uenced by the parse tree kernel when a sourcecode is small I a source code is large the effect o graphkernel is larger than that o the parse tree kernel From theresults it can be concluded that the proposed method doesnot consider only local-level structural inormation but alsohigh-level structural inormation effectively

Te 1047297nal 1-measure o program source code plagiarismdetection is given in able 983090 Te proposed method showsthe best 1-measure compared to other kernels or opensource plagiarism systems Te difference o 1-measure is983088983090983097 against JPlag 983088983089983095 against CCFinder 983088983088983096 against themodi1047297ed graph and 983088983088983093 against the modi1047297ed parse treekernel Tis result implies that or source code plagiarism

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 2: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 213

983090 Mathematical Problems in Engineering

level structural view Since both views are useul to detectplagiarized pairs o program source codes the similarity measure or program source code comparison should re1047298ectboth kinds o structural inormation simultaneously

Tis paper proposes a novel method to calculate thesimilarity between two program source codes Te proposed

method adopts two kinds o structural inormation based onkernel unctions A kernel unction is one o the prominentmethods or comparing structured data [983095] It can be used asa similarity measure since it calculates an inner product o two elements [983096] Te proposed method re1047298ects the syntacticstructure o a program source code using a parse tree kernel[983097] A parse tree kernelcomputesthe similarity between a pairo parse trees Tus the syntactic structural inormation o asource code is ully re1047298ected into the proposed method by it Te proposed method consider also a dynamic structureo a source code by adopting a graph kernel [983089983088] Te graphkernel in the proposed method computes the similarity valuebetween a pair o unction-call graphs Since these twokernels are instances o

-convolution kernels by Haussler

[983095] they compare trees and graphs efficiently without expliciteature enumeration respectively

Each kernel produces its own similarity based on itsown structural view Te proposed method incorporatesbothkinds o structural inormation into program source codecomparison by composing the parse tree kernel andthe graphkernel into a composite kernel Since the proposed compositekernel is based on a weighted sum composition optimizingthe weights o base kernels is a crucial issue In this paper theweights are automatically determined with the considerationo the complexity o source codes Tus i any two programsource codes are given the proposed method can computetheir similarity

Te proposed method is evaluated on source code pla-giarism detection with a real data set used in the work [983093]Our experiments show three important results First thesimilarity measure based on the parse tree kernel is morereliable than that based on the graph kernel in terms o overall perormance Second the more complicated a sourcecode the more useul the similarity based on the graphkernel in detecting plagiarized pairs Finally the proposedmethod which combines the parse tree kernel and the graphkernel detects plagiarism o real-world program source codessuccessully Tese results prove that global-level structuralinormation is an important actor or comparing programsand that the proposed similarity measure that combines

syntactic and dynamic structural inormation results in goodperormance or program source code plagiarism detection

In summary we draw the ollowing contributions in thispaper

(983089) We design and implement a source code similarity measure or plagiarism detection based on two kindso structural inormation syntactic inormation anddependencies o unction calls From the act that thechange o the structure o the source code is harderthan the one o the user-de1047297ned vocabulary theproposed method is robust or detecting plagiarismpairs

(983090) In order to make use o two kinds o structural inor-mation simultaneously we design new combinationmethod based on a complexity o source code Tismakes the proposed method work more robustly eveni we compare between complicated source codes

Te rest o the paper is organized as ollows Section 983090is devoted to related studies on program source code com-parison and program source code plagiarism detectionSection 983091 introduces the problems o source code plagiarismdetection Te similarity measure based on parse tree kerneland unctional-call graph kernel are given in Sections 983092and 983093 respectively Section 983094 proposes the composite kernelthat combines the parse tree kernel and the graph kernelSection 983095 explains the experimental settings and the resultsobtained by the proposed method Finally Section 983096 drawsthe conclusion

2 Related Work

Measuring the similarity between two objects is undamen-tal and important in many scienti1047297c 1047297elds For instancein molecular biology it is ofen required to measure thesequence similarity between protein pairs Tus many sim-ilarity measures have been proposed such as distance-basedmeasurements including Euclidean distance and Levenshteindistance mutual inormation [983089983089] inormation content usingwordNet [983089983090] and citation-based similarity [983089983091] In additionthe measures have been applied to various applications suchas inormation retrieval [983089983092 983089983093] and clustering [983089983094] as theircore part

Te similarity measure or source codes have been o interest or a long time Most early studies are based onattribute-counting similarity [983089983095 983089983096] Te similarity repre-sents a program as a vector o various elements such as thenumber o operators and operands Ten a vector similarity is normally used to detect plagiarized pairs However theperormance o this approach is relatively poor compared toother methods that consider structure o source codes sincethis approach uses only the abstract-level inormation

In order to overcome the weaknesses o the attribute-counting-metric approach some studies incorporate thestructural inormation o source code into their similarity measure In general the structure o source codes is a treeor a graph From the act that a source code is compiledinto a syntactic structure as described by the grammar o a

programming language some studies used a tree matchingalgorithm to calculate the similarity between source codes[983092 983089983097] However the algorithm represents a source codeas a string that contains certain structural inormation sothat it ails in re1047298ecting an entire structure o a source codeinto a similarity measure On the other hand some otherstudies used the knowledge that comes rom the topology o source codes Horwitz 1047297rst adopted graph structures orcomparing two programs [983091] and determined which compo-nents are changed rom one to another based on the programdependency graph [983090983088] Liu et al also used the programdependency graph to represent a sourcecode andadopted therelaxed subgraph isomorphismtesting to compare two source

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 313

Mathematical Problems in Engineering 983091

codes efficiently [983090983089] Kammer built a plagiarism detectiontool or Haskell language [983094] He extracted a call-graph 1047297rstrom a source code Te nodes in the graph are unctions andan edge indicates that one unction calls another unctionTen he transormed the graph into a tree to comparesource codes efficiently Finally he applied the

1038389lowast-based tree

edit distance and the tree isomorphism algorithm or thecomparison o source codes However this approach loses

much inormation lying on a graph since the graph is trans-ormed into a tree Lim et al proposed a method o detectingplagiarism o Java programs through analysis o 1047298ow pathso Java bytecodes [983090983090] Since a 1047298ow path o a program isa sequence o basic blocks upon execution o the programthey tried to align 1047298ow paths using a semiglobal alignmentalgorithm and then detected program source code plagiarismpairs Chae et al also tried to detect binary program(executable 1047297le) plagiarism [983090983091] Tey constructed 1047297rst A-CFG (API-labeled control 1047298ow graph) that is the unctionalabstraction o a program and then generated a vector o

a prede1047297ned dimension rom the A-CFG using MicrosofDevelopment Network Application Programming Interace(MSDN API) to avoid computational intractability Finallythey used a random walk (page-rank) algorithm to calcu-late the similarity between programs Unortunately thisapproach cannot be applied to other languages that do nothave MSDN API Recently some studies have used stopword1103925-grams [983090983092] and topic model [983090983093] to measure the similarity

Several program source code plagiarism detection toolsare available online Most o them use a string tokenizationand a string matching algorithm to measure the similarity between source codes Prechelt et al proposed JPlag It is asystem that can be used to detect plagiarism o source codeswritten in C C++ Java and Scheme [983090983094] It 1047297rst extractedtokens rom source codes and then compared tokens rom

each source code using Karp-Rabin Greedy String ilingalgorithm Another widely-used plagiarism detection systemis MOSS (MeasureO Sofware Similarity) proposed by Aiken[983090983095] Itis also based on a string-matching algorithmIt dividesprograms into 907317-grams where a 907317-gram is a contiguoussubstring o length 907317 Ten the similarity is determined by the number o same 907317-grams shared by the programs Oneo the state-o-the-art and well-known plagiarism detectionsystems is CCFinder proposed by Kamiya et al [983090983096] It usesboth attribute-counting-metric and structure inormation A

source code is transormed into a set o normalized tokensequences by its own transormation rules Te transor-mation rules are constructed manually or each languageto express structural characteristics o languages Ten thenormalized tokens are compared to reveal clone pairs insource codes Tey showed relatively good perormance andused structural inormation to some degree but it does notre1047298ect the structural inormation o source codes into itssimilarity measure ully

Te proposed method in this paper extends the kernel-based method proposed by Son et al [983093] Tey compared thestructure o source codes using a kernel unction directlyTey used the parse tree kernel especially [983097] a kind

o -convolution kernels [983095] to compare tree structureo source codes Compared to this work the proposedmethod incorporates unction-call inormation additionallyTe unctional calls are one o the important structuralinormation in comparing source codes Te main problemo Son et al is that they ocused only on syntactic structural

inormation that is local and static On the other hand theunction-call inormation provides a global view on sourcecode execution Tereore the plagiarized pairs o sourcecodes are detected more accurately by considering not only the syntactic structure but also the unction-call inormation

3 Program Source Code Plagiarism Detection

Plagiarism detection or program source codes also knownas programming plagiarism detection aims to detect plagia-rized source code pairs among a set o source codes Tesource code plagiarism detection normally consists o three

steps as illustrated in Figure 983089 Te 1047297rst step is a preprocessingstep that extracts eatures such as tokens and parse trees romsource codes Te second step calculates pair-wise similarity with the extracted eatures and a similarity measure Tere-ore the similarity values among all pairs are recorded into asimilarity matrix Finally the groups o source codes that aremost likely to be plagiarized are selected according to theirsimilarity values

Formally let be a set o source codes Plagiarismdetection aims to generate a list o plagiarized source codesbased on a similarity sim() between isin and isin I the similarity o a pair is higher than a prede1047297ned threshold

the pair is determined as a plagiarized one Tereore or

a source code isin a set o plagiarized source codes isde1047297ned as

= isin | sim ge = 1048701 (983089)

Te similarity measure sim( ) is decided by inormationtype extracted rom source codes A source code has twokinds o inormation lexical and structural inormation Lex-ical inormation corresponds to variables and the reserved

words like public i and or Tis vocabulary is composed o a large set o rarely occurring user-de1047297ned words (variables)and a small set o requently occurring words (reservedwords) On the other hand structural inormation corre-sponds to a structure that is determined by reserved wordsAmong them structural inormation is a more importantclue or detecting plagiarism since tokens can be easily con-

verted into other tokens without understanding the subjecto a source code Tereore this paper ocuses on structuralinormation Note that a source program has two kinds o structural inormation One is syntactic structure that isusually expressed as a parse tree and the other is unction-call graph structure

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 413

983092 Mathematical Problems in Engineering

Sourcecodes

Preprocessor

Similaritymatrix

Plagiarismdetectionmodule

Plagiarizedsourcecodespair

Similarity measure

Parse trees

Function call graphs

okens

Dependency graphs

F983145983143983157983154983141 983089 Te overall process o program source code plagiarism detection

expressionexpression

CompilationUnit

typeDeclaration ltEOFgt

classOrInterfaceModi1047297er

public

classDeclaration

class Fibo classBody

classBodyDeclaration

modi1047297er

classOrInterfaceModi1047297er

public

memberDeclaration

methodDeclaration

type

primitiveype

int

formalParameters

(

formalParameter

type

methodBody

modi1047297er

classOrInterfaceModi1047297er

static rFib onacci

formalParameter

number

primitiveype

int

block

blockStatement

statement

blockStatement

if parExpression statement

expression

||

( )

expression

expression expression

primary

number

primary

literal

1

expression

==expression expression

primary

number

primary

literal

2

blockStatement

statement

return statement

block

blockStatement

statement

expressionreturn

primary

literal

1

expression

(expression )

primary

rFibonacci

expressionList

-expression

primary

number

primary

literal

1

expression

(expression )

primary

rFibonacci

expressionList

-expression

primary

number

primary

literal

2

983223 983223 983223 983223 983223 983223

formalParameterList )

==

+

F983145983143983157983154983141 983090 A parse tree extracted rom the source code in Box 983089

4 Similarity Measure for Source CodesBased on Parse Tree Kernel

983092983089 Source Code as a ree Te program source code canbe naturally represented as a parse tree o which each nodedenotes variables reserved words operators and so onFigure 983090 shows an example parse tree extracted rom a Javacode in Box 983089 (this parse tree is slightly different rom theparse tree used in Son et al [983093] Tis is because a more recent

version o Java grammar is used in this paper) Te Java codein Box 983089 implements the Fibonacci sequence Due to the lack o width o paper only one unction rFibonacci is shown

in Figure 983090 while there exist 1047297ve unctions in Box 983089 Asshownin this algorithm a parse tree rom a simple source code canbe very large and deep-rooted

In this paper we use ANLR (another tool or languagerecognition) (httpwwwantlrorg) to extract a parse treerom a source code ANLR proposed by Parr and Quong isa language tool that provides a ramework or constructingrecognizers interpreters compilers and translators romgrammatical descriptions [983090983097] With ANLR and a languagegrammar a tree parser that translates a source code into aparse tree can be easily constructed

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 513

Mathematical Problems in Engineering 983093

public class Fibo public static int rFibonacci(int number)

if (number == 983089 number == 983090) return 983089

return rFibonacci(number-983089) + rFibonacci(number-983090)

private static int sum(int value983089 int value983090) return value983089 + value983090

public static int iFibonacci(int number) if (number == 983089 number == 983090)

return 983089int 1047297bo983089 = 983089 1047297bo983090 = 983089int 1047297bonacci = initOne()for (int i = 983091 i lt= number i++)

1047297bonacci = sum(1047297bo983089 1047297bo983090)1047297bo983089 = 1047297bo9830901047297bo983090 = 1047297bonacci

return 1047297bonacciprivate static int initOne()

return 983089public static void main(String[] args)

int rFibo = FiborFibonacci(983095)int iFibo = FiboiFibonacci(983095)Systemout println(rFibo)Systemout println(iFibo)

B983151983160 983089 An example o Java source code

Since parse tree has syntactic structural inormation ametric or parse tree that re1047298ects entire structural inorma-tion is required Te parse tree kernel is one o such metricsIt compares parse trees without manually designed structuraleatures

983092983090 Parse ree Kernel Parse tree kernel is a kernel that is

designed to compare tree structures such as parse trees o natural language sentences Tis kernel maps a parse treeonto a space spanned by all subtrees that can appear possibly in the parse tree Te explicit enumeration o all subtreesis computationally ineasible since the number o subtreesincreases exponentially as the size o tree grows Collins andDuffy proposed a method to compute the inner product o two trees without having to enumerate all subtrees [983097]

Let subtree1 subtree2 be all o the subtrees in a parsetree Ten can be represented as a vector

= ⟨subtree1 () subtree2 () subtree1038389 ()⟩ (983090)

where subtree1103925() is the requency o subtree1103925 in the parsetree Te kernel unction between two parse trees 1 and2 is de1047297ned as tree(1 2) = 12 and is determined as

tree 9830801 2983081 = 12= sum1103925 subtree1103925 9830801983081 sdot subtree1103925 9830802983081= sum

1103925

1048616 sum9073171isin1

subtree1038389983080110392519830811048617

sdot 1048616 sum9073172isin2

subtree1038389983080110392529830811048617

= sum9073171isin1

sum9073172isin2

98308011039251 11039252983081

(983091)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 613

983094 Mathematical Problems in Engineering

where 1and 2

are all the nodes in trees 1 and 2 Teindicator unction subtree1038389

(1103925) is 983089 i subtree1103925 is rooted at node1103925 and 983088 otherwise (11039251 11039252) is a unction which is de1047297ned as

98308011039251 11039252983081 = sum1103925

subtree103838998308011039251983081 sdot subtree1038389

98308011039252983081 (983092)

Tis unction can be calculated in polynomial time using theollowing recursive de1047297nition

(i) I the productions at 11039251 and 11039252 are different

98308011039251 11039252983081 = 0 (983093)

(ii) I both 11039251 and 11039252 are preterminals

98308011039251 11039252983081 = 1 (983094)

(iii) Otherwise the unction can be de1047297ned as ollow

98308011039251

11039252

983081 = nc(9073171)

prod1103925 9830801 + 983080ch1103925

98308011039251

983081 ch1103925

98308011039252

983081983081983081 (983095)

where nc(11039251) is the number o children o node 11039251 inthe tree

Since the productions at 11039251 and 11039252 are the same nc(11039251)is also equal to nc(11039252) Here ch1103925(11039251) denotes the th childnode o 11039251 Tis recursive algorithm is based on the act thatall subtrees rooted at a certain node can be constructed by combining the subtrees rooted at each o its children

983092983091 Modi1047297ed Parse ree Kernel Te parse tree kernel hasshown good perormance or parse trees o natural language

but it does not work well or program source code compari-son due to two issues Te 1047297rst issue is asymmetric in1047298uenceo node changes Te parse tree rom a source code tends tobe much larger and deeper than that rom a natural languagesentence Tereore the changes near root node have beenre1047298ected more ofen than the changes near lea nodes Tesecond issue is the sequence o subtrees Te original parsetree kernel counts the sequence o subtrees by consideringtheir order However the order o two substructures in asource code is meaningless in programming languages

Son et al proposed a modi1047297ed parse tree kernel to copewith these issues [983093] In order to solve the 1047297rst issue they introduced a decay actor tree and a threshold Δ that controlthe effect o large subtrees Te decay actor scales the relativeimportance o subtrees by their size As the depth o a subtreeincreases the kernel value o the subtree is penalized by (tree)size where size is the depth o the subtree In additionthe limitation o the maximum depth o possible subtrees isset as Δ so that the effect o large subtrees could be reducedTe second issue is solved by changing unction in (983095) toignore the order o two nodes

With a decay actor tree and a threshold Δ the recursiverules o the parse tree kernel is modi1047297ed as ollows

(i) I 11039251 and 11039252 are different

98308011039251

11039252

983081 = 0 (983096)

(ii) I both 11039251 and 11039252 are terminals or the current depth isequal to Δ

98308011039251 11039252983081 = tree (983097)

Equation (983095) cannot be used with these new recursive rules

since the number o child nodes can be different in 11039251 and 11039252Tus we adopt the maximum similarity between child nodesAs a result the unction in (983095) becomes

98308011039251 11039252983081 = tree

nc(9073171)prod1103925

8520081 + maxchisinch11039252

983080ch1103925 98308011039251983081 ch983081852009 (983089983088)

where ch9073172is a set o child nodes o 11039252

Te parse tree kernel with the modi1047297ed unctionmptdoes not satisy Mercerrsquos condition However many unctionsthat do not satisy Mercerrsquos condition [983091983088 983091983089] work wellin computing similarity [983091983090] Finally this parse tree kernelis used as the similarity measure sim

( ) in (983089) or syntactic

structural comparison o source codes

5 Similarity Measure for Source CodesBased on Graph Kernel

983093983089 Source Code as a Graph Recently program sourcecodes are written with object-oriented concepts and severalreactoring techniques so that the codes are getting more andmore modularized at unctional level Since a source codeencodes program logic to solve a problem the execution 1047298ow at unction level is one o the important actors to identiy the source code Tereore this unction-level 1047298ow should beconsidered to compare source codes

One possible representation or the unction-level 1047298ow isa unction-call graph which represents dependencies amongunctions within a program Let be a source code Ten a

unction calls graph = ( ) is a directed graph extractedrom where V isin is a unction in Tus || is the numbero unctions in is a set o edges and each edge represents adependency relation between unctions Tat is an edge 1103925 isin that connects nodesV 1103925 and V implies that a unction V 1103925 callsanother unction V Te weight o an edge is given by

1103925 = 9831631 i V 1103925 calls V 0 otherwise (983089983089)

Figure 983091 illustrates an example unction-call graphextracted rom the Java code in Box 983089 Tis code contains 1047297veunctions including main First main calls rFibonacciand iFibonacci in order Since rFibonacci is arecursive unction it calls itsel iFibonacci calls twounctions initOne and sum to initialize a variable and geta sum Finally main calls println to print out the results

A rule-based approach is adopted to extract a call graphrom a source code Simple rules are used to 1047297nd thecaller-callee relationship rom a parse tree For instance inJava a rule ldquoif lsquoexpression (expresionList)rsquo is oundthen expression is a called unction namerdquo is used to1047297nd subtrees rom a parse tree Ten unction names and

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 713

Mathematical Problems in Engineering 983095

rFibonacci

iFibonacci

main

initOne

sum

println

F983145983143983157983154983141 983091 Example o extracted call graph rom Java code

parameters are extracted rom the matched subtrees Tenodes or the extracted unction names are connected to thecaller node

Function-call graph o a program depicts how the pro-gram executes at unction level and how unctions are relatedto one another Since the 1047298ow o a program is quite uniqueaccording to the task the similarity between two sources canbe calculated using the 1047298ows o the programs Since this 1047298ow is represented as a unction-call graph the graph kernel is thebest method to compare unction-call graphs It showed goodperormance in several 1047297elds including biology and social

network analysis

983093983090 Graph Kernel Graph kernel is a kernel thatis designed tocompare graph structures Like the parse tree kernel a graphis mapped onto a eature space spanned by their subgraphs inthe graph kernel Te intuitive eature o the graph kernel isgraph isomorphism that determines the topological identityAccording to Gartner et al [983089983088] however it is as hard asdeciding whether two graphs are isomorphic to compute any complete graph kernel with an injective mapping unctionor all graphs where graph isomorphism is a NP-completeproblem [983091983091] Tus most graph kernels ocus on alternativeeature representation o graphs

Te random walk graph kernel is one o the most widely used graph kernels It uses all possible random walks aseatures or graphs Let be a set o all possible random walksandW907317() denotes the set o all possible walks with 1103925 edgesin graph For each random walk isin whose length is 1103925the corresponding eature mapping unction o a graph isgiven as

Φ () = radic907317graph1048699 isinW907317 () forall 9830801103925983081 = 9830801103925983081983165 (983089983090)

where 907317graph is a weight or the length 1103925 and (1103925) and (1103925)are the

th label o the random walk

and

respectively

Te kernel unction between two graphs 1 and 2 denotedby graph(1 2) can be de1047297ned as

graph 9830801 2983081 = sum

Φ 9830801983081 sdotΦ 9830802983081 (983089983091)

Gartner et al proposed an approach to calculate all ran-domwalks within two graphswithout explicit enumeration o all random walks [983089983088] A direct product graph o two graphs1 = (1 1) and 2 = (2 2) denoted by 1 times 2 =(times times) where times is its node set and times is its edge set isde1047297ned as ollows

times 9830801 times 2983081 = 1048699983080V 1 1983081 isin 1 times 2 | 983080V 1983081 = 9830801983081983165 times 9830801 times 2983081 = 1048699983080983080V 1 1983081 983080V 2 2983081983081 isin times 9830801 times 2983081 |

983080V 1 V 2983081 isin 1 9830801 2983081 isin 2

983080V 1

V 2

983081 = 9830801

2

983081983165

(983089983092)

where (V ) is the label o a node V and ( ) is the label o an edge between node and node Based on the directproduct graph the random walk kernel can be calculated

Let 1038389times isin R|times|times|times| denote an adjacency matrix o the directproduct1 times2 With a weighting actorgraph ge 0graph in(983089983091) can be rewritten as

graph 9830801 2983081 = |times|sum1103925=1

1048667infinsum907317=0

907317graph1038389907317times1048669

1103925

(983089983093)

Tis random walk kernel can be computed in (11039253

) usingSylvester equation or Conjugate Gradient method where 1103925 isthe number o nodes [983091983092]

983093983091 Modi1047297ed Graph Kernel When a graph kernel is used tocompare source codes goodperormance is not expected dueto theact that the graph kernel measures similaritiesbetweenwalks with an identical label Since the labels (unctionnames) o nodes within the unction-call graph are decidedby human developers they are seldom identical even i thesource codes are simple Tereore the graph kernel has toconsider nonidentical labels

Borgwardt et al modi1047297ed the random walk kernel to

compare nonidentical labels by changing the direct productgraph to include all pairs o nodes and edges [983091983093] Assumethat nodes are compared by a node kernelnode and edges arecompared by an edge kernel kedge edge Tat is node(V )calculates the similarity between two labels rom the nodes V

and andedge((V 1103925 V 1103925+1)(1103925 1103925+1)) computes the similarity between two edges (V 1103925 V 1103925+1) and (1103925 1103925+1) With these twokernels the random walk kernel between two unction callgraphs 1 and 2 is now de1047297ned as

mg 9830801 2983081 = 907317minus1sum1103925=1

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 (983089983094)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 813

983096 Mathematical Problems in Engineering

where

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081= node 983080V 1103925 1103925983081 sdot node 983080V 1103925+1 1103925+1983081

sdot edge

983080983080V 1103925

V 1103925+1

983081 9830801103925

1103925+1

983081983081

(983089983095)

I this modi1047297ed random walk kernel is used or thecomparison o source code the node kernel node and theedge kernel edge should be de1047297ned Note that the labelso edges in unction-call graph are binary values by (983089983089)Tus edge is simply designed to compare binary valuesTe simplest orm or node(V ) is a unction that returns983089 when V and have similar string patterns 983088 otherwiseTat is it returns 983089 i a distance between V and is smallerthan a prede1047297ned threshold In this paper we simply useLevenshtein distance as the distance and set the threshold as983088983093

Te modi1047297ed random walk kernel mg can be alsocomputed using (983089983093) However the adjacency matrix

1038389times o

the direct product 1 times 2 should be modi1047297ed as

9831311038389times983133(V 10383891038389)(V 907317907317)

= 852091step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081

i 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 isin times0 otherwise(983089983096)

wheretimes isashortormo times(1times2) and the edges (V 1103925 V 1103925+1)and (1103925 1103925+1) belong to 1 and 2 respectively As in parsetree kernel this modi1047297ed graph kernel is used to comparesource codes as similarity measure sim( ) in (983089)

6 Similarity Measure for Source CodesBased on a Composite Kernel

Te modi1047297ed parse tree kernel manages syntactic structuralinormation whereas the modi1047297ed graph kernel considershigh-level topological inormation o source codes In orderto make use o both kinds o inormation the compositiono the two kernels is required Cristianini and Shawe-aylorproved that a new kernel can be obtained by combiningexisting several kernels with some closure properties suchas weighted sum and multiplication [983091983094] Among variousclosure properties this paper adopts the weighted sum sinceit is simple and widely used

Beore combining two kernels the kernels should benormalized since the modi1047297ed parse tree kernel mpt andthe modi1047297ed graph kernelmg are not bound Tereore onekernel can dominate other in their composition In order toremove this effect the kernels are 1047297rst normalized When akernel() is given itsnormalized one()isde1047297nedas

= radic ( ) sdot 983080 983081 (983089983097)

Tereore () is bounded between 983088 and 983089 Tat is 0 le

(

) le 1

Our composite kernelco is composed o the normalizedmodi1047297ed parse tree kernel

mpt and the normalized modi1047297ed

graph kernelmg Tatis the composite kernel

co or given

two source codes and is de1047297ned as

co

= 9830801 minus 983081 sdot mpt

983080

983081 + sdot mg

983080

983081 (983090983088)

where is a mixing weight between two kernels and areparse trees extracted rom source codes and respectivelyand and are call graphs rom and respectively Telarger is the more signi1047297cant mpt is On the other handas the value o gets small the graph kernel mg is moresigni1047297cant than the parse tree kernel mpt Tis compositekernel is used as our 1047297nal similarity measure sim( ) in (983089)

Te parse tree kernel compares source codes with local-level view since it is based on subtree comparison Most pla-giarized source codes change a small portion o the originalsource code Tus the parse tree kernel has shown goodperormances in general However it does not re1047298ect the 1047298ow o the program which is dynamic structural inormationTegraph kernel on the other hand calculates the similarity interms o dynamic high-level view Tus when source codesconsist o a number o unctions the graph kernel achievesreasonable perormance As a result should be determinedby the complexity o source codes since it is a parameter tocontrol the relative importance between the parse tree kerneland the graph kernel

Tere are many methodsthat measure the complexity o asource code One widely-used method is the cyclomatic com-plexity proposed by McCabe [983091983095] Te cyclomatic complexity is a graph-theoretic quantitative metric and measures thenumber o paths within a source code It is simply calculated

using a control 1047298ow graph o a source code where the nodeso the graph correspond to entities o the source code andan (directed) edge between two nodes implies a dependency relation between entities Given the control 1047298ow graph o asource code () the cyclomatic complexity o source code is de1047297ned as

() = minus + 2 (983090983089)

where is the number o edges o the graph is the numbero nodes and is the number o connected components Telarger is the more complicated source code is

In this paper we measure the complexity o a sourcecode using itsunction-call graph Since a unction-call graphrepresents dependencies among unctions within a programit can be considered as a kind o control 1047298ow graphs whereentities o the source code are the unction in the source codeand the edge implies dependencies between unctions

Let() and() be the cyclomatic complexities o twosource codes and respectively Since is the weight o two normalized kernels it has to normalize between 983088 and 983089Te sigmoid unction is de1047297ned or all real input values andreturns a positive value between 983088 and 983089 Tus the sigmoidunction is adopted or o (983090983088) and is de1047297ned as

= 1

1 + minus(min(()())minus25)

(983090983090)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 913

Mathematical Problems in Engineering 983097

983137983138983148983141 983089 Simple statistics on the real data set

Inormation Value

Number o total assignments 983091983094

Number o submitted source codes 983093983093983093

Average number o submitted codes per assignment 983089983093983092983090

Minimum number o lines in source code 983092983097Maximum number o lines in source code 983090983096983094983091

Average number o lines per source code 983091983088983093983088983095

Minimum number o nodes in source code 983089983090

Maximum number o nodes in source code 983092983092983095

Average number o nodes in source code 983094983092983090983097

Number o marked plagiarism pairs 983089983095983093

where min() returns the minimum value between and According to (983090983090) as the cyclomatic complexity gets larger also increases is to be 983088983093 when thecyclomatic complexity o source code is 983090983093 Tis indicates

that when the cyclomatic complexity o source code is983090983093 the parse tree kernel and the graph kernel have anequal importance in the composite kernel A number o source code analysis applications regard source codes whosecyclomatic complexity is more than 983090983093 as complicated codes(httpmsdnmicrosofcomen-uslibraryms983089983096983090983090983089983090aspx )Tus we set 983090983093 as the equal point o the importance betweenthe parse tree kernel and the graph kernel

7 Experiments

983095983089 Experimental Settings For experiments the same dataset in the work o Son et al [983093] is used Tis data setis collected rom actual programming assignments o Javaclasses submitted by undergraduate students rom 983090983088983088983093 to983090983088983088983097 able 983089 shows simple statistics o the data set Te totalnumber o programming assignments is 983091983094 and the numbero submitted source codes is 983093983093983093 or the 983091983094 assignments Tusthe average number o source codes per an assignment is983089983093983092983090

Figure 983092 shows the histogram o the source codes perlines Te -axis is the number o program lines and the -axis represents the number o source codes As shown in this1047297gure about 983095983093 o source codes are written with less than983092983088983088 lines Te minimum number o lines o a source code is983092983097 and the maximum is 983090983096983094983091 Te average number o lines

per code is 983091983088983093983088983095In our data set the minimum number o unctions within

a program is 983089983090 whereas the maximum number is 983092983092983095Te programs with larger number o programs are paintprograms with any buttons In the paint programs studentsare required to set a layout manually with raw unctionssuch assetBounds Tus paint programs have a number o unctions Te average number o unctions is 983094983092983090983095

wo annotators created the gold standard or this dataset Tey investigated all sourcecodes and markedplagiarizedpairs manually In order to measurethe reliability andvalidity o the annotators Cohenrsquos kappa agreement [983091983096] is measuredTe kappa agreement o the annotators is

= 093 whichalls

180

160

140

120

100

80

60

40

20

00 500 1000 1500 2000 2500 3000

Source code lines

N u m

b e r o f s

o u r c e c o

d e s

F983145983143983157983154983141 983092 Histogram o source lines per code

on the category o ldquoalmost perect agreementrdquo Only the pairs judged as plagiarized pairs by both annotators are regardedas actual plagiarized pairs In total 983089983095983093 pairs are marked asplagiarized pairs

Tree metrics are used as evaluation measure precisionrecall and 1-measure Tey are calculated as ollows

Precision = o correctly detected plagiarized pairs

o detected plagiarized pairs

Recall = o correctly detected plagiarized pairs

o true plagiarized pairs

1-measure = 2 sdot precision

sdot recall

precision + recall (983090983091)

In order to evaluate the proposed method severalbaseline systems are used It is compared with JPlag andCCFinder In all experiments or the parse tree kernel thethreshold or subtree depth Δ is set as 983091 and the decay actortree is 983088983089 Te decay actor graph or graph kernel is set tobe 983088983089 empirically in (983090983089) is set to be 983089 because each sourcecode in our data set is a single program

983095983090 Experimental Results Beore evaluating the perormanceo plagiarism detection we 1047297rst examine relatedness between

the number o source code lines and the cyclomatic complex-ity Tis examination tries to show that (983090983090) is easible Since is determined with cyclomatic complexity it is expected to beproportionalto cyclomatic complexity Figure 983093 shows scatterplot between the number o lines and cyclomatic complexityAs shown in this 1047297gure they are highly correlated with eachother in our data set Te Pearson correlation coefficient is983088983095983089983092 Tis result implies that it is easible to set in (983090983090) to beproportional to cyclomatic complexity

In order to see the effect o threshold in (983089) in ourmethod the perormances are measured according to the

values o Figure 983094 shows the perormance o the pro-posed method or various

As

increases the precision

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1013

983089983088 Mathematical Problems in Engineering

600

500

400

300

200

100

0

C y c

l o m a t i c

c o m p l e x i t y

0 500 1000 1500 2000 2500 3000

Number of lines

F983145983143983157983154983141 983093 Scatter plot according to the number o lines o sourcecodes and the corresponding cyclomatic compelxity

10

09

08

07

06

05

086 088 092090 094 096 098

RecallPrecision

F -measure

Te value of threshold

F983145983143983157983154983141 983094 Perormance o the proposed system or real-world dataset

also increases while the recall decreases slightly Te bestperormance is achieved at = 096 with 983088983096983095 o 1-measureTus

= 096 is used at all the experiments below

Figure 983095 compares the proposed method with variouskernels according to the number o source code lines In this1047297gure the -axis is the number lines o source code andthe -axis represent the average 1-measure As shown inthis 1047297gure the original graph kernel shows the worst per-ormance Since it uses only the graph structure o unctioncalls it ofen ails in calculating the similarity among sourcecodes For example assume that there are two source codesIn one source code main calls a unction add and addcalls another unction multiply In the other source code main calls multiply and multiply calls add Tese twosource codes are the same under the graph kernel since thelabel inormation is ignored by the graph kernel Without

Graph kernel

Parse tree kernel

Modi1047297ed graph kernel

Proposed method

10

08

06

04

02

00

A v e r a g e F 1

- m e a s u r e

lt100 lt200 lt300 lt400 gt400

Number of lines

F983145983143983157983154983141 983095 Average 1-measure according to the number o source

code lines

the labels these two graphs are identical On the other handthe modi1047297ed graph kernel utilizes the label inormation As aresult it achieves better perormance than the graph kernel

Te parse tree kernel achieves higher perormance thanother methods or the source codes with less than 983091983088983088 linesWhen the number o lines in source codes is small theplagiarized codes are ofen made by changing the originalone locally Tus the parse tree kernel detects plagiarizedpairs accurately or the codes with small number o linesWhen source codes have more than 983091983088983088 lines the modi1047297edgraph kernel shows slightly better perormance than theparse tree kernel Tis result implies that high-level structuralinormation is another actor to compare (large) source codesand the modi1047297ed graph kernel can re1047298ect this structuralinormation well

Te proposed method that combines the parse tree kerneland the modi1047297ed graph kernel achieves the best perormanceor all source codes except those with 300 sim 400 linesSince the cyclomatic complexity o source codes with 300 sim400 lines is near 983090983093 the proposed method re1047298ects the parsetree kernel and the modi1047297ed graph kernel equally Tusit achieves an average perormance o the kernels By thecyclomatic complexity o source codes the proposed method

is more in1047298uenced by the parse tree kernel when a sourcecode is small I a source code is large the effect o graphkernel is larger than that o the parse tree kernel From theresults it can be concluded that the proposed method doesnot consider only local-level structural inormation but alsohigh-level structural inormation effectively

Te 1047297nal 1-measure o program source code plagiarismdetection is given in able 983090 Te proposed method showsthe best 1-measure compared to other kernels or opensource plagiarism systems Te difference o 1-measure is983088983090983097 against JPlag 983088983089983095 against CCFinder 983088983088983096 against themodi1047297ed graph and 983088983088983093 against the modi1047297ed parse treekernel Tis result implies that or source code plagiarism

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 3: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 313

Mathematical Problems in Engineering 983091

codes efficiently [983090983089] Kammer built a plagiarism detectiontool or Haskell language [983094] He extracted a call-graph 1047297rstrom a source code Te nodes in the graph are unctions andan edge indicates that one unction calls another unctionTen he transormed the graph into a tree to comparesource codes efficiently Finally he applied the

1038389lowast-based tree

edit distance and the tree isomorphism algorithm or thecomparison o source codes However this approach loses

much inormation lying on a graph since the graph is trans-ormed into a tree Lim et al proposed a method o detectingplagiarism o Java programs through analysis o 1047298ow pathso Java bytecodes [983090983090] Since a 1047298ow path o a program isa sequence o basic blocks upon execution o the programthey tried to align 1047298ow paths using a semiglobal alignmentalgorithm and then detected program source code plagiarismpairs Chae et al also tried to detect binary program(executable 1047297le) plagiarism [983090983091] Tey constructed 1047297rst A-CFG (API-labeled control 1047298ow graph) that is the unctionalabstraction o a program and then generated a vector o

a prede1047297ned dimension rom the A-CFG using MicrosofDevelopment Network Application Programming Interace(MSDN API) to avoid computational intractability Finallythey used a random walk (page-rank) algorithm to calcu-late the similarity between programs Unortunately thisapproach cannot be applied to other languages that do nothave MSDN API Recently some studies have used stopword1103925-grams [983090983092] and topic model [983090983093] to measure the similarity

Several program source code plagiarism detection toolsare available online Most o them use a string tokenizationand a string matching algorithm to measure the similarity between source codes Prechelt et al proposed JPlag It is asystem that can be used to detect plagiarism o source codeswritten in C C++ Java and Scheme [983090983094] It 1047297rst extractedtokens rom source codes and then compared tokens rom

each source code using Karp-Rabin Greedy String ilingalgorithm Another widely-used plagiarism detection systemis MOSS (MeasureO Sofware Similarity) proposed by Aiken[983090983095] Itis also based on a string-matching algorithmIt dividesprograms into 907317-grams where a 907317-gram is a contiguoussubstring o length 907317 Ten the similarity is determined by the number o same 907317-grams shared by the programs Oneo the state-o-the-art and well-known plagiarism detectionsystems is CCFinder proposed by Kamiya et al [983090983096] It usesboth attribute-counting-metric and structure inormation A

source code is transormed into a set o normalized tokensequences by its own transormation rules Te transor-mation rules are constructed manually or each languageto express structural characteristics o languages Ten thenormalized tokens are compared to reveal clone pairs insource codes Tey showed relatively good perormance andused structural inormation to some degree but it does notre1047298ect the structural inormation o source codes into itssimilarity measure ully

Te proposed method in this paper extends the kernel-based method proposed by Son et al [983093] Tey compared thestructure o source codes using a kernel unction directlyTey used the parse tree kernel especially [983097] a kind

o -convolution kernels [983095] to compare tree structureo source codes Compared to this work the proposedmethod incorporates unction-call inormation additionallyTe unctional calls are one o the important structuralinormation in comparing source codes Te main problemo Son et al is that they ocused only on syntactic structural

inormation that is local and static On the other hand theunction-call inormation provides a global view on sourcecode execution Tereore the plagiarized pairs o sourcecodes are detected more accurately by considering not only the syntactic structure but also the unction-call inormation

3 Program Source Code Plagiarism Detection

Plagiarism detection or program source codes also knownas programming plagiarism detection aims to detect plagia-rized source code pairs among a set o source codes Tesource code plagiarism detection normally consists o three

steps as illustrated in Figure 983089 Te 1047297rst step is a preprocessingstep that extracts eatures such as tokens and parse trees romsource codes Te second step calculates pair-wise similarity with the extracted eatures and a similarity measure Tere-ore the similarity values among all pairs are recorded into asimilarity matrix Finally the groups o source codes that aremost likely to be plagiarized are selected according to theirsimilarity values

Formally let be a set o source codes Plagiarismdetection aims to generate a list o plagiarized source codesbased on a similarity sim() between isin and isin I the similarity o a pair is higher than a prede1047297ned threshold

the pair is determined as a plagiarized one Tereore or

a source code isin a set o plagiarized source codes isde1047297ned as

= isin | sim ge = 1048701 (983089)

Te similarity measure sim( ) is decided by inormationtype extracted rom source codes A source code has twokinds o inormation lexical and structural inormation Lex-ical inormation corresponds to variables and the reserved

words like public i and or Tis vocabulary is composed o a large set o rarely occurring user-de1047297ned words (variables)and a small set o requently occurring words (reservedwords) On the other hand structural inormation corre-sponds to a structure that is determined by reserved wordsAmong them structural inormation is a more importantclue or detecting plagiarism since tokens can be easily con-

verted into other tokens without understanding the subjecto a source code Tereore this paper ocuses on structuralinormation Note that a source program has two kinds o structural inormation One is syntactic structure that isusually expressed as a parse tree and the other is unction-call graph structure

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 413

983092 Mathematical Problems in Engineering

Sourcecodes

Preprocessor

Similaritymatrix

Plagiarismdetectionmodule

Plagiarizedsourcecodespair

Similarity measure

Parse trees

Function call graphs

okens

Dependency graphs

F983145983143983157983154983141 983089 Te overall process o program source code plagiarism detection

expressionexpression

CompilationUnit

typeDeclaration ltEOFgt

classOrInterfaceModi1047297er

public

classDeclaration

class Fibo classBody

classBodyDeclaration

modi1047297er

classOrInterfaceModi1047297er

public

memberDeclaration

methodDeclaration

type

primitiveype

int

formalParameters

(

formalParameter

type

methodBody

modi1047297er

classOrInterfaceModi1047297er

static rFib onacci

formalParameter

number

primitiveype

int

block

blockStatement

statement

blockStatement

if parExpression statement

expression

||

( )

expression

expression expression

primary

number

primary

literal

1

expression

==expression expression

primary

number

primary

literal

2

blockStatement

statement

return statement

block

blockStatement

statement

expressionreturn

primary

literal

1

expression

(expression )

primary

rFibonacci

expressionList

-expression

primary

number

primary

literal

1

expression

(expression )

primary

rFibonacci

expressionList

-expression

primary

number

primary

literal

2

983223 983223 983223 983223 983223 983223

formalParameterList )

==

+

F983145983143983157983154983141 983090 A parse tree extracted rom the source code in Box 983089

4 Similarity Measure for Source CodesBased on Parse Tree Kernel

983092983089 Source Code as a ree Te program source code canbe naturally represented as a parse tree o which each nodedenotes variables reserved words operators and so onFigure 983090 shows an example parse tree extracted rom a Javacode in Box 983089 (this parse tree is slightly different rom theparse tree used in Son et al [983093] Tis is because a more recent

version o Java grammar is used in this paper) Te Java codein Box 983089 implements the Fibonacci sequence Due to the lack o width o paper only one unction rFibonacci is shown

in Figure 983090 while there exist 1047297ve unctions in Box 983089 Asshownin this algorithm a parse tree rom a simple source code canbe very large and deep-rooted

In this paper we use ANLR (another tool or languagerecognition) (httpwwwantlrorg) to extract a parse treerom a source code ANLR proposed by Parr and Quong isa language tool that provides a ramework or constructingrecognizers interpreters compilers and translators romgrammatical descriptions [983090983097] With ANLR and a languagegrammar a tree parser that translates a source code into aparse tree can be easily constructed

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 513

Mathematical Problems in Engineering 983093

public class Fibo public static int rFibonacci(int number)

if (number == 983089 number == 983090) return 983089

return rFibonacci(number-983089) + rFibonacci(number-983090)

private static int sum(int value983089 int value983090) return value983089 + value983090

public static int iFibonacci(int number) if (number == 983089 number == 983090)

return 983089int 1047297bo983089 = 983089 1047297bo983090 = 983089int 1047297bonacci = initOne()for (int i = 983091 i lt= number i++)

1047297bonacci = sum(1047297bo983089 1047297bo983090)1047297bo983089 = 1047297bo9830901047297bo983090 = 1047297bonacci

return 1047297bonacciprivate static int initOne()

return 983089public static void main(String[] args)

int rFibo = FiborFibonacci(983095)int iFibo = FiboiFibonacci(983095)Systemout println(rFibo)Systemout println(iFibo)

B983151983160 983089 An example o Java source code

Since parse tree has syntactic structural inormation ametric or parse tree that re1047298ects entire structural inorma-tion is required Te parse tree kernel is one o such metricsIt compares parse trees without manually designed structuraleatures

983092983090 Parse ree Kernel Parse tree kernel is a kernel that is

designed to compare tree structures such as parse trees o natural language sentences Tis kernel maps a parse treeonto a space spanned by all subtrees that can appear possibly in the parse tree Te explicit enumeration o all subtreesis computationally ineasible since the number o subtreesincreases exponentially as the size o tree grows Collins andDuffy proposed a method to compute the inner product o two trees without having to enumerate all subtrees [983097]

Let subtree1 subtree2 be all o the subtrees in a parsetree Ten can be represented as a vector

= ⟨subtree1 () subtree2 () subtree1038389 ()⟩ (983090)

where subtree1103925() is the requency o subtree1103925 in the parsetree Te kernel unction between two parse trees 1 and2 is de1047297ned as tree(1 2) = 12 and is determined as

tree 9830801 2983081 = 12= sum1103925 subtree1103925 9830801983081 sdot subtree1103925 9830802983081= sum

1103925

1048616 sum9073171isin1

subtree1038389983080110392519830811048617

sdot 1048616 sum9073172isin2

subtree1038389983080110392529830811048617

= sum9073171isin1

sum9073172isin2

98308011039251 11039252983081

(983091)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 613

983094 Mathematical Problems in Engineering

where 1and 2

are all the nodes in trees 1 and 2 Teindicator unction subtree1038389

(1103925) is 983089 i subtree1103925 is rooted at node1103925 and 983088 otherwise (11039251 11039252) is a unction which is de1047297ned as

98308011039251 11039252983081 = sum1103925

subtree103838998308011039251983081 sdot subtree1038389

98308011039252983081 (983092)

Tis unction can be calculated in polynomial time using theollowing recursive de1047297nition

(i) I the productions at 11039251 and 11039252 are different

98308011039251 11039252983081 = 0 (983093)

(ii) I both 11039251 and 11039252 are preterminals

98308011039251 11039252983081 = 1 (983094)

(iii) Otherwise the unction can be de1047297ned as ollow

98308011039251

11039252

983081 = nc(9073171)

prod1103925 9830801 + 983080ch1103925

98308011039251

983081 ch1103925

98308011039252

983081983081983081 (983095)

where nc(11039251) is the number o children o node 11039251 inthe tree

Since the productions at 11039251 and 11039252 are the same nc(11039251)is also equal to nc(11039252) Here ch1103925(11039251) denotes the th childnode o 11039251 Tis recursive algorithm is based on the act thatall subtrees rooted at a certain node can be constructed by combining the subtrees rooted at each o its children

983092983091 Modi1047297ed Parse ree Kernel Te parse tree kernel hasshown good perormance or parse trees o natural language

but it does not work well or program source code compari-son due to two issues Te 1047297rst issue is asymmetric in1047298uenceo node changes Te parse tree rom a source code tends tobe much larger and deeper than that rom a natural languagesentence Tereore the changes near root node have beenre1047298ected more ofen than the changes near lea nodes Tesecond issue is the sequence o subtrees Te original parsetree kernel counts the sequence o subtrees by consideringtheir order However the order o two substructures in asource code is meaningless in programming languages

Son et al proposed a modi1047297ed parse tree kernel to copewith these issues [983093] In order to solve the 1047297rst issue they introduced a decay actor tree and a threshold Δ that controlthe effect o large subtrees Te decay actor scales the relativeimportance o subtrees by their size As the depth o a subtreeincreases the kernel value o the subtree is penalized by (tree)size where size is the depth o the subtree In additionthe limitation o the maximum depth o possible subtrees isset as Δ so that the effect o large subtrees could be reducedTe second issue is solved by changing unction in (983095) toignore the order o two nodes

With a decay actor tree and a threshold Δ the recursiverules o the parse tree kernel is modi1047297ed as ollows

(i) I 11039251 and 11039252 are different

98308011039251

11039252

983081 = 0 (983096)

(ii) I both 11039251 and 11039252 are terminals or the current depth isequal to Δ

98308011039251 11039252983081 = tree (983097)

Equation (983095) cannot be used with these new recursive rules

since the number o child nodes can be different in 11039251 and 11039252Tus we adopt the maximum similarity between child nodesAs a result the unction in (983095) becomes

98308011039251 11039252983081 = tree

nc(9073171)prod1103925

8520081 + maxchisinch11039252

983080ch1103925 98308011039251983081 ch983081852009 (983089983088)

where ch9073172is a set o child nodes o 11039252

Te parse tree kernel with the modi1047297ed unctionmptdoes not satisy Mercerrsquos condition However many unctionsthat do not satisy Mercerrsquos condition [983091983088 983091983089] work wellin computing similarity [983091983090] Finally this parse tree kernelis used as the similarity measure sim

( ) in (983089) or syntactic

structural comparison o source codes

5 Similarity Measure for Source CodesBased on Graph Kernel

983093983089 Source Code as a Graph Recently program sourcecodes are written with object-oriented concepts and severalreactoring techniques so that the codes are getting more andmore modularized at unctional level Since a source codeencodes program logic to solve a problem the execution 1047298ow at unction level is one o the important actors to identiy the source code Tereore this unction-level 1047298ow should beconsidered to compare source codes

One possible representation or the unction-level 1047298ow isa unction-call graph which represents dependencies amongunctions within a program Let be a source code Ten a

unction calls graph = ( ) is a directed graph extractedrom where V isin is a unction in Tus || is the numbero unctions in is a set o edges and each edge represents adependency relation between unctions Tat is an edge 1103925 isin that connects nodesV 1103925 and V implies that a unction V 1103925 callsanother unction V Te weight o an edge is given by

1103925 = 9831631 i V 1103925 calls V 0 otherwise (983089983089)

Figure 983091 illustrates an example unction-call graphextracted rom the Java code in Box 983089 Tis code contains 1047297veunctions including main First main calls rFibonacciand iFibonacci in order Since rFibonacci is arecursive unction it calls itsel iFibonacci calls twounctions initOne and sum to initialize a variable and geta sum Finally main calls println to print out the results

A rule-based approach is adopted to extract a call graphrom a source code Simple rules are used to 1047297nd thecaller-callee relationship rom a parse tree For instance inJava a rule ldquoif lsquoexpression (expresionList)rsquo is oundthen expression is a called unction namerdquo is used to1047297nd subtrees rom a parse tree Ten unction names and

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 713

Mathematical Problems in Engineering 983095

rFibonacci

iFibonacci

main

initOne

sum

println

F983145983143983157983154983141 983091 Example o extracted call graph rom Java code

parameters are extracted rom the matched subtrees Tenodes or the extracted unction names are connected to thecaller node

Function-call graph o a program depicts how the pro-gram executes at unction level and how unctions are relatedto one another Since the 1047298ow o a program is quite uniqueaccording to the task the similarity between two sources canbe calculated using the 1047298ows o the programs Since this 1047298ow is represented as a unction-call graph the graph kernel is thebest method to compare unction-call graphs It showed goodperormance in several 1047297elds including biology and social

network analysis

983093983090 Graph Kernel Graph kernel is a kernel thatis designed tocompare graph structures Like the parse tree kernel a graphis mapped onto a eature space spanned by their subgraphs inthe graph kernel Te intuitive eature o the graph kernel isgraph isomorphism that determines the topological identityAccording to Gartner et al [983089983088] however it is as hard asdeciding whether two graphs are isomorphic to compute any complete graph kernel with an injective mapping unctionor all graphs where graph isomorphism is a NP-completeproblem [983091983091] Tus most graph kernels ocus on alternativeeature representation o graphs

Te random walk graph kernel is one o the most widely used graph kernels It uses all possible random walks aseatures or graphs Let be a set o all possible random walksandW907317() denotes the set o all possible walks with 1103925 edgesin graph For each random walk isin whose length is 1103925the corresponding eature mapping unction o a graph isgiven as

Φ () = radic907317graph1048699 isinW907317 () forall 9830801103925983081 = 9830801103925983081983165 (983089983090)

where 907317graph is a weight or the length 1103925 and (1103925) and (1103925)are the

th label o the random walk

and

respectively

Te kernel unction between two graphs 1 and 2 denotedby graph(1 2) can be de1047297ned as

graph 9830801 2983081 = sum

Φ 9830801983081 sdotΦ 9830802983081 (983089983091)

Gartner et al proposed an approach to calculate all ran-domwalks within two graphswithout explicit enumeration o all random walks [983089983088] A direct product graph o two graphs1 = (1 1) and 2 = (2 2) denoted by 1 times 2 =(times times) where times is its node set and times is its edge set isde1047297ned as ollows

times 9830801 times 2983081 = 1048699983080V 1 1983081 isin 1 times 2 | 983080V 1983081 = 9830801983081983165 times 9830801 times 2983081 = 1048699983080983080V 1 1983081 983080V 2 2983081983081 isin times 9830801 times 2983081 |

983080V 1 V 2983081 isin 1 9830801 2983081 isin 2

983080V 1

V 2

983081 = 9830801

2

983081983165

(983089983092)

where (V ) is the label o a node V and ( ) is the label o an edge between node and node Based on the directproduct graph the random walk kernel can be calculated

Let 1038389times isin R|times|times|times| denote an adjacency matrix o the directproduct1 times2 With a weighting actorgraph ge 0graph in(983089983091) can be rewritten as

graph 9830801 2983081 = |times|sum1103925=1

1048667infinsum907317=0

907317graph1038389907317times1048669

1103925

(983089983093)

Tis random walk kernel can be computed in (11039253

) usingSylvester equation or Conjugate Gradient method where 1103925 isthe number o nodes [983091983092]

983093983091 Modi1047297ed Graph Kernel When a graph kernel is used tocompare source codes goodperormance is not expected dueto theact that the graph kernel measures similaritiesbetweenwalks with an identical label Since the labels (unctionnames) o nodes within the unction-call graph are decidedby human developers they are seldom identical even i thesource codes are simple Tereore the graph kernel has toconsider nonidentical labels

Borgwardt et al modi1047297ed the random walk kernel to

compare nonidentical labels by changing the direct productgraph to include all pairs o nodes and edges [983091983093] Assumethat nodes are compared by a node kernelnode and edges arecompared by an edge kernel kedge edge Tat is node(V )calculates the similarity between two labels rom the nodes V

and andedge((V 1103925 V 1103925+1)(1103925 1103925+1)) computes the similarity between two edges (V 1103925 V 1103925+1) and (1103925 1103925+1) With these twokernels the random walk kernel between two unction callgraphs 1 and 2 is now de1047297ned as

mg 9830801 2983081 = 907317minus1sum1103925=1

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 (983089983094)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 813

983096 Mathematical Problems in Engineering

where

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081= node 983080V 1103925 1103925983081 sdot node 983080V 1103925+1 1103925+1983081

sdot edge

983080983080V 1103925

V 1103925+1

983081 9830801103925

1103925+1

983081983081

(983089983095)

I this modi1047297ed random walk kernel is used or thecomparison o source code the node kernel node and theedge kernel edge should be de1047297ned Note that the labelso edges in unction-call graph are binary values by (983089983089)Tus edge is simply designed to compare binary valuesTe simplest orm or node(V ) is a unction that returns983089 when V and have similar string patterns 983088 otherwiseTat is it returns 983089 i a distance between V and is smallerthan a prede1047297ned threshold In this paper we simply useLevenshtein distance as the distance and set the threshold as983088983093

Te modi1047297ed random walk kernel mg can be alsocomputed using (983089983093) However the adjacency matrix

1038389times o

the direct product 1 times 2 should be modi1047297ed as

9831311038389times983133(V 10383891038389)(V 907317907317)

= 852091step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081

i 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 isin times0 otherwise(983089983096)

wheretimes isashortormo times(1times2) and the edges (V 1103925 V 1103925+1)and (1103925 1103925+1) belong to 1 and 2 respectively As in parsetree kernel this modi1047297ed graph kernel is used to comparesource codes as similarity measure sim( ) in (983089)

6 Similarity Measure for Source CodesBased on a Composite Kernel

Te modi1047297ed parse tree kernel manages syntactic structuralinormation whereas the modi1047297ed graph kernel considershigh-level topological inormation o source codes In orderto make use o both kinds o inormation the compositiono the two kernels is required Cristianini and Shawe-aylorproved that a new kernel can be obtained by combiningexisting several kernels with some closure properties suchas weighted sum and multiplication [983091983094] Among variousclosure properties this paper adopts the weighted sum sinceit is simple and widely used

Beore combining two kernels the kernels should benormalized since the modi1047297ed parse tree kernel mpt andthe modi1047297ed graph kernelmg are not bound Tereore onekernel can dominate other in their composition In order toremove this effect the kernels are 1047297rst normalized When akernel() is given itsnormalized one()isde1047297nedas

= radic ( ) sdot 983080 983081 (983089983097)

Tereore () is bounded between 983088 and 983089 Tat is 0 le

(

) le 1

Our composite kernelco is composed o the normalizedmodi1047297ed parse tree kernel

mpt and the normalized modi1047297ed

graph kernelmg Tatis the composite kernel

co or given

two source codes and is de1047297ned as

co

= 9830801 minus 983081 sdot mpt

983080

983081 + sdot mg

983080

983081 (983090983088)

where is a mixing weight between two kernels and areparse trees extracted rom source codes and respectivelyand and are call graphs rom and respectively Telarger is the more signi1047297cant mpt is On the other handas the value o gets small the graph kernel mg is moresigni1047297cant than the parse tree kernel mpt Tis compositekernel is used as our 1047297nal similarity measure sim( ) in (983089)

Te parse tree kernel compares source codes with local-level view since it is based on subtree comparison Most pla-giarized source codes change a small portion o the originalsource code Tus the parse tree kernel has shown goodperormances in general However it does not re1047298ect the 1047298ow o the program which is dynamic structural inormationTegraph kernel on the other hand calculates the similarity interms o dynamic high-level view Tus when source codesconsist o a number o unctions the graph kernel achievesreasonable perormance As a result should be determinedby the complexity o source codes since it is a parameter tocontrol the relative importance between the parse tree kerneland the graph kernel

Tere are many methodsthat measure the complexity o asource code One widely-used method is the cyclomatic com-plexity proposed by McCabe [983091983095] Te cyclomatic complexity is a graph-theoretic quantitative metric and measures thenumber o paths within a source code It is simply calculated

using a control 1047298ow graph o a source code where the nodeso the graph correspond to entities o the source code andan (directed) edge between two nodes implies a dependency relation between entities Given the control 1047298ow graph o asource code () the cyclomatic complexity o source code is de1047297ned as

() = minus + 2 (983090983089)

where is the number o edges o the graph is the numbero nodes and is the number o connected components Telarger is the more complicated source code is

In this paper we measure the complexity o a sourcecode using itsunction-call graph Since a unction-call graphrepresents dependencies among unctions within a programit can be considered as a kind o control 1047298ow graphs whereentities o the source code are the unction in the source codeand the edge implies dependencies between unctions

Let() and() be the cyclomatic complexities o twosource codes and respectively Since is the weight o two normalized kernels it has to normalize between 983088 and 983089Te sigmoid unction is de1047297ned or all real input values andreturns a positive value between 983088 and 983089 Tus the sigmoidunction is adopted or o (983090983088) and is de1047297ned as

= 1

1 + minus(min(()())minus25)

(983090983090)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 913

Mathematical Problems in Engineering 983097

983137983138983148983141 983089 Simple statistics on the real data set

Inormation Value

Number o total assignments 983091983094

Number o submitted source codes 983093983093983093

Average number o submitted codes per assignment 983089983093983092983090

Minimum number o lines in source code 983092983097Maximum number o lines in source code 983090983096983094983091

Average number o lines per source code 983091983088983093983088983095

Minimum number o nodes in source code 983089983090

Maximum number o nodes in source code 983092983092983095

Average number o nodes in source code 983094983092983090983097

Number o marked plagiarism pairs 983089983095983093

where min() returns the minimum value between and According to (983090983090) as the cyclomatic complexity gets larger also increases is to be 983088983093 when thecyclomatic complexity o source code is 983090983093 Tis indicates

that when the cyclomatic complexity o source code is983090983093 the parse tree kernel and the graph kernel have anequal importance in the composite kernel A number o source code analysis applications regard source codes whosecyclomatic complexity is more than 983090983093 as complicated codes(httpmsdnmicrosofcomen-uslibraryms983089983096983090983090983089983090aspx )Tus we set 983090983093 as the equal point o the importance betweenthe parse tree kernel and the graph kernel

7 Experiments

983095983089 Experimental Settings For experiments the same dataset in the work o Son et al [983093] is used Tis data setis collected rom actual programming assignments o Javaclasses submitted by undergraduate students rom 983090983088983088983093 to983090983088983088983097 able 983089 shows simple statistics o the data set Te totalnumber o programming assignments is 983091983094 and the numbero submitted source codes is 983093983093983093 or the 983091983094 assignments Tusthe average number o source codes per an assignment is983089983093983092983090

Figure 983092 shows the histogram o the source codes perlines Te -axis is the number o program lines and the -axis represents the number o source codes As shown in this1047297gure about 983095983093 o source codes are written with less than983092983088983088 lines Te minimum number o lines o a source code is983092983097 and the maximum is 983090983096983094983091 Te average number o lines

per code is 983091983088983093983088983095In our data set the minimum number o unctions within

a program is 983089983090 whereas the maximum number is 983092983092983095Te programs with larger number o programs are paintprograms with any buttons In the paint programs studentsare required to set a layout manually with raw unctionssuch assetBounds Tus paint programs have a number o unctions Te average number o unctions is 983094983092983090983095

wo annotators created the gold standard or this dataset Tey investigated all sourcecodes and markedplagiarizedpairs manually In order to measurethe reliability andvalidity o the annotators Cohenrsquos kappa agreement [983091983096] is measuredTe kappa agreement o the annotators is

= 093 whichalls

180

160

140

120

100

80

60

40

20

00 500 1000 1500 2000 2500 3000

Source code lines

N u m

b e r o f s

o u r c e c o

d e s

F983145983143983157983154983141 983092 Histogram o source lines per code

on the category o ldquoalmost perect agreementrdquo Only the pairs judged as plagiarized pairs by both annotators are regardedas actual plagiarized pairs In total 983089983095983093 pairs are marked asplagiarized pairs

Tree metrics are used as evaluation measure precisionrecall and 1-measure Tey are calculated as ollows

Precision = o correctly detected plagiarized pairs

o detected plagiarized pairs

Recall = o correctly detected plagiarized pairs

o true plagiarized pairs

1-measure = 2 sdot precision

sdot recall

precision + recall (983090983091)

In order to evaluate the proposed method severalbaseline systems are used It is compared with JPlag andCCFinder In all experiments or the parse tree kernel thethreshold or subtree depth Δ is set as 983091 and the decay actortree is 983088983089 Te decay actor graph or graph kernel is set tobe 983088983089 empirically in (983090983089) is set to be 983089 because each sourcecode in our data set is a single program

983095983090 Experimental Results Beore evaluating the perormanceo plagiarism detection we 1047297rst examine relatedness between

the number o source code lines and the cyclomatic complex-ity Tis examination tries to show that (983090983090) is easible Since is determined with cyclomatic complexity it is expected to beproportionalto cyclomatic complexity Figure 983093 shows scatterplot between the number o lines and cyclomatic complexityAs shown in this 1047297gure they are highly correlated with eachother in our data set Te Pearson correlation coefficient is983088983095983089983092 Tis result implies that it is easible to set in (983090983090) to beproportional to cyclomatic complexity

In order to see the effect o threshold in (983089) in ourmethod the perormances are measured according to the

values o Figure 983094 shows the perormance o the pro-posed method or various

As

increases the precision

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1013

983089983088 Mathematical Problems in Engineering

600

500

400

300

200

100

0

C y c

l o m a t i c

c o m p l e x i t y

0 500 1000 1500 2000 2500 3000

Number of lines

F983145983143983157983154983141 983093 Scatter plot according to the number o lines o sourcecodes and the corresponding cyclomatic compelxity

10

09

08

07

06

05

086 088 092090 094 096 098

RecallPrecision

F -measure

Te value of threshold

F983145983143983157983154983141 983094 Perormance o the proposed system or real-world dataset

also increases while the recall decreases slightly Te bestperormance is achieved at = 096 with 983088983096983095 o 1-measureTus

= 096 is used at all the experiments below

Figure 983095 compares the proposed method with variouskernels according to the number o source code lines In this1047297gure the -axis is the number lines o source code andthe -axis represent the average 1-measure As shown inthis 1047297gure the original graph kernel shows the worst per-ormance Since it uses only the graph structure o unctioncalls it ofen ails in calculating the similarity among sourcecodes For example assume that there are two source codesIn one source code main calls a unction add and addcalls another unction multiply In the other source code main calls multiply and multiply calls add Tese twosource codes are the same under the graph kernel since thelabel inormation is ignored by the graph kernel Without

Graph kernel

Parse tree kernel

Modi1047297ed graph kernel

Proposed method

10

08

06

04

02

00

A v e r a g e F 1

- m e a s u r e

lt100 lt200 lt300 lt400 gt400

Number of lines

F983145983143983157983154983141 983095 Average 1-measure according to the number o source

code lines

the labels these two graphs are identical On the other handthe modi1047297ed graph kernel utilizes the label inormation As aresult it achieves better perormance than the graph kernel

Te parse tree kernel achieves higher perormance thanother methods or the source codes with less than 983091983088983088 linesWhen the number o lines in source codes is small theplagiarized codes are ofen made by changing the originalone locally Tus the parse tree kernel detects plagiarizedpairs accurately or the codes with small number o linesWhen source codes have more than 983091983088983088 lines the modi1047297edgraph kernel shows slightly better perormance than theparse tree kernel Tis result implies that high-level structuralinormation is another actor to compare (large) source codesand the modi1047297ed graph kernel can re1047298ect this structuralinormation well

Te proposed method that combines the parse tree kerneland the modi1047297ed graph kernel achieves the best perormanceor all source codes except those with 300 sim 400 linesSince the cyclomatic complexity o source codes with 300 sim400 lines is near 983090983093 the proposed method re1047298ects the parsetree kernel and the modi1047297ed graph kernel equally Tusit achieves an average perormance o the kernels By thecyclomatic complexity o source codes the proposed method

is more in1047298uenced by the parse tree kernel when a sourcecode is small I a source code is large the effect o graphkernel is larger than that o the parse tree kernel From theresults it can be concluded that the proposed method doesnot consider only local-level structural inormation but alsohigh-level structural inormation effectively

Te 1047297nal 1-measure o program source code plagiarismdetection is given in able 983090 Te proposed method showsthe best 1-measure compared to other kernels or opensource plagiarism systems Te difference o 1-measure is983088983090983097 against JPlag 983088983089983095 against CCFinder 983088983088983096 against themodi1047297ed graph and 983088983088983093 against the modi1047297ed parse treekernel Tis result implies that or source code plagiarism

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 4: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 413

983092 Mathematical Problems in Engineering

Sourcecodes

Preprocessor

Similaritymatrix

Plagiarismdetectionmodule

Plagiarizedsourcecodespair

Similarity measure

Parse trees

Function call graphs

okens

Dependency graphs

F983145983143983157983154983141 983089 Te overall process o program source code plagiarism detection

expressionexpression

CompilationUnit

typeDeclaration ltEOFgt

classOrInterfaceModi1047297er

public

classDeclaration

class Fibo classBody

classBodyDeclaration

modi1047297er

classOrInterfaceModi1047297er

public

memberDeclaration

methodDeclaration

type

primitiveype

int

formalParameters

(

formalParameter

type

methodBody

modi1047297er

classOrInterfaceModi1047297er

static rFib onacci

formalParameter

number

primitiveype

int

block

blockStatement

statement

blockStatement

if parExpression statement

expression

||

( )

expression

expression expression

primary

number

primary

literal

1

expression

==expression expression

primary

number

primary

literal

2

blockStatement

statement

return statement

block

blockStatement

statement

expressionreturn

primary

literal

1

expression

(expression )

primary

rFibonacci

expressionList

-expression

primary

number

primary

literal

1

expression

(expression )

primary

rFibonacci

expressionList

-expression

primary

number

primary

literal

2

983223 983223 983223 983223 983223 983223

formalParameterList )

==

+

F983145983143983157983154983141 983090 A parse tree extracted rom the source code in Box 983089

4 Similarity Measure for Source CodesBased on Parse Tree Kernel

983092983089 Source Code as a ree Te program source code canbe naturally represented as a parse tree o which each nodedenotes variables reserved words operators and so onFigure 983090 shows an example parse tree extracted rom a Javacode in Box 983089 (this parse tree is slightly different rom theparse tree used in Son et al [983093] Tis is because a more recent

version o Java grammar is used in this paper) Te Java codein Box 983089 implements the Fibonacci sequence Due to the lack o width o paper only one unction rFibonacci is shown

in Figure 983090 while there exist 1047297ve unctions in Box 983089 Asshownin this algorithm a parse tree rom a simple source code canbe very large and deep-rooted

In this paper we use ANLR (another tool or languagerecognition) (httpwwwantlrorg) to extract a parse treerom a source code ANLR proposed by Parr and Quong isa language tool that provides a ramework or constructingrecognizers interpreters compilers and translators romgrammatical descriptions [983090983097] With ANLR and a languagegrammar a tree parser that translates a source code into aparse tree can be easily constructed

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 513

Mathematical Problems in Engineering 983093

public class Fibo public static int rFibonacci(int number)

if (number == 983089 number == 983090) return 983089

return rFibonacci(number-983089) + rFibonacci(number-983090)

private static int sum(int value983089 int value983090) return value983089 + value983090

public static int iFibonacci(int number) if (number == 983089 number == 983090)

return 983089int 1047297bo983089 = 983089 1047297bo983090 = 983089int 1047297bonacci = initOne()for (int i = 983091 i lt= number i++)

1047297bonacci = sum(1047297bo983089 1047297bo983090)1047297bo983089 = 1047297bo9830901047297bo983090 = 1047297bonacci

return 1047297bonacciprivate static int initOne()

return 983089public static void main(String[] args)

int rFibo = FiborFibonacci(983095)int iFibo = FiboiFibonacci(983095)Systemout println(rFibo)Systemout println(iFibo)

B983151983160 983089 An example o Java source code

Since parse tree has syntactic structural inormation ametric or parse tree that re1047298ects entire structural inorma-tion is required Te parse tree kernel is one o such metricsIt compares parse trees without manually designed structuraleatures

983092983090 Parse ree Kernel Parse tree kernel is a kernel that is

designed to compare tree structures such as parse trees o natural language sentences Tis kernel maps a parse treeonto a space spanned by all subtrees that can appear possibly in the parse tree Te explicit enumeration o all subtreesis computationally ineasible since the number o subtreesincreases exponentially as the size o tree grows Collins andDuffy proposed a method to compute the inner product o two trees without having to enumerate all subtrees [983097]

Let subtree1 subtree2 be all o the subtrees in a parsetree Ten can be represented as a vector

= ⟨subtree1 () subtree2 () subtree1038389 ()⟩ (983090)

where subtree1103925() is the requency o subtree1103925 in the parsetree Te kernel unction between two parse trees 1 and2 is de1047297ned as tree(1 2) = 12 and is determined as

tree 9830801 2983081 = 12= sum1103925 subtree1103925 9830801983081 sdot subtree1103925 9830802983081= sum

1103925

1048616 sum9073171isin1

subtree1038389983080110392519830811048617

sdot 1048616 sum9073172isin2

subtree1038389983080110392529830811048617

= sum9073171isin1

sum9073172isin2

98308011039251 11039252983081

(983091)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 613

983094 Mathematical Problems in Engineering

where 1and 2

are all the nodes in trees 1 and 2 Teindicator unction subtree1038389

(1103925) is 983089 i subtree1103925 is rooted at node1103925 and 983088 otherwise (11039251 11039252) is a unction which is de1047297ned as

98308011039251 11039252983081 = sum1103925

subtree103838998308011039251983081 sdot subtree1038389

98308011039252983081 (983092)

Tis unction can be calculated in polynomial time using theollowing recursive de1047297nition

(i) I the productions at 11039251 and 11039252 are different

98308011039251 11039252983081 = 0 (983093)

(ii) I both 11039251 and 11039252 are preterminals

98308011039251 11039252983081 = 1 (983094)

(iii) Otherwise the unction can be de1047297ned as ollow

98308011039251

11039252

983081 = nc(9073171)

prod1103925 9830801 + 983080ch1103925

98308011039251

983081 ch1103925

98308011039252

983081983081983081 (983095)

where nc(11039251) is the number o children o node 11039251 inthe tree

Since the productions at 11039251 and 11039252 are the same nc(11039251)is also equal to nc(11039252) Here ch1103925(11039251) denotes the th childnode o 11039251 Tis recursive algorithm is based on the act thatall subtrees rooted at a certain node can be constructed by combining the subtrees rooted at each o its children

983092983091 Modi1047297ed Parse ree Kernel Te parse tree kernel hasshown good perormance or parse trees o natural language

but it does not work well or program source code compari-son due to two issues Te 1047297rst issue is asymmetric in1047298uenceo node changes Te parse tree rom a source code tends tobe much larger and deeper than that rom a natural languagesentence Tereore the changes near root node have beenre1047298ected more ofen than the changes near lea nodes Tesecond issue is the sequence o subtrees Te original parsetree kernel counts the sequence o subtrees by consideringtheir order However the order o two substructures in asource code is meaningless in programming languages

Son et al proposed a modi1047297ed parse tree kernel to copewith these issues [983093] In order to solve the 1047297rst issue they introduced a decay actor tree and a threshold Δ that controlthe effect o large subtrees Te decay actor scales the relativeimportance o subtrees by their size As the depth o a subtreeincreases the kernel value o the subtree is penalized by (tree)size where size is the depth o the subtree In additionthe limitation o the maximum depth o possible subtrees isset as Δ so that the effect o large subtrees could be reducedTe second issue is solved by changing unction in (983095) toignore the order o two nodes

With a decay actor tree and a threshold Δ the recursiverules o the parse tree kernel is modi1047297ed as ollows

(i) I 11039251 and 11039252 are different

98308011039251

11039252

983081 = 0 (983096)

(ii) I both 11039251 and 11039252 are terminals or the current depth isequal to Δ

98308011039251 11039252983081 = tree (983097)

Equation (983095) cannot be used with these new recursive rules

since the number o child nodes can be different in 11039251 and 11039252Tus we adopt the maximum similarity between child nodesAs a result the unction in (983095) becomes

98308011039251 11039252983081 = tree

nc(9073171)prod1103925

8520081 + maxchisinch11039252

983080ch1103925 98308011039251983081 ch983081852009 (983089983088)

where ch9073172is a set o child nodes o 11039252

Te parse tree kernel with the modi1047297ed unctionmptdoes not satisy Mercerrsquos condition However many unctionsthat do not satisy Mercerrsquos condition [983091983088 983091983089] work wellin computing similarity [983091983090] Finally this parse tree kernelis used as the similarity measure sim

( ) in (983089) or syntactic

structural comparison o source codes

5 Similarity Measure for Source CodesBased on Graph Kernel

983093983089 Source Code as a Graph Recently program sourcecodes are written with object-oriented concepts and severalreactoring techniques so that the codes are getting more andmore modularized at unctional level Since a source codeencodes program logic to solve a problem the execution 1047298ow at unction level is one o the important actors to identiy the source code Tereore this unction-level 1047298ow should beconsidered to compare source codes

One possible representation or the unction-level 1047298ow isa unction-call graph which represents dependencies amongunctions within a program Let be a source code Ten a

unction calls graph = ( ) is a directed graph extractedrom where V isin is a unction in Tus || is the numbero unctions in is a set o edges and each edge represents adependency relation between unctions Tat is an edge 1103925 isin that connects nodesV 1103925 and V implies that a unction V 1103925 callsanother unction V Te weight o an edge is given by

1103925 = 9831631 i V 1103925 calls V 0 otherwise (983089983089)

Figure 983091 illustrates an example unction-call graphextracted rom the Java code in Box 983089 Tis code contains 1047297veunctions including main First main calls rFibonacciand iFibonacci in order Since rFibonacci is arecursive unction it calls itsel iFibonacci calls twounctions initOne and sum to initialize a variable and geta sum Finally main calls println to print out the results

A rule-based approach is adopted to extract a call graphrom a source code Simple rules are used to 1047297nd thecaller-callee relationship rom a parse tree For instance inJava a rule ldquoif lsquoexpression (expresionList)rsquo is oundthen expression is a called unction namerdquo is used to1047297nd subtrees rom a parse tree Ten unction names and

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 713

Mathematical Problems in Engineering 983095

rFibonacci

iFibonacci

main

initOne

sum

println

F983145983143983157983154983141 983091 Example o extracted call graph rom Java code

parameters are extracted rom the matched subtrees Tenodes or the extracted unction names are connected to thecaller node

Function-call graph o a program depicts how the pro-gram executes at unction level and how unctions are relatedto one another Since the 1047298ow o a program is quite uniqueaccording to the task the similarity between two sources canbe calculated using the 1047298ows o the programs Since this 1047298ow is represented as a unction-call graph the graph kernel is thebest method to compare unction-call graphs It showed goodperormance in several 1047297elds including biology and social

network analysis

983093983090 Graph Kernel Graph kernel is a kernel thatis designed tocompare graph structures Like the parse tree kernel a graphis mapped onto a eature space spanned by their subgraphs inthe graph kernel Te intuitive eature o the graph kernel isgraph isomorphism that determines the topological identityAccording to Gartner et al [983089983088] however it is as hard asdeciding whether two graphs are isomorphic to compute any complete graph kernel with an injective mapping unctionor all graphs where graph isomorphism is a NP-completeproblem [983091983091] Tus most graph kernels ocus on alternativeeature representation o graphs

Te random walk graph kernel is one o the most widely used graph kernels It uses all possible random walks aseatures or graphs Let be a set o all possible random walksandW907317() denotes the set o all possible walks with 1103925 edgesin graph For each random walk isin whose length is 1103925the corresponding eature mapping unction o a graph isgiven as

Φ () = radic907317graph1048699 isinW907317 () forall 9830801103925983081 = 9830801103925983081983165 (983089983090)

where 907317graph is a weight or the length 1103925 and (1103925) and (1103925)are the

th label o the random walk

and

respectively

Te kernel unction between two graphs 1 and 2 denotedby graph(1 2) can be de1047297ned as

graph 9830801 2983081 = sum

Φ 9830801983081 sdotΦ 9830802983081 (983089983091)

Gartner et al proposed an approach to calculate all ran-domwalks within two graphswithout explicit enumeration o all random walks [983089983088] A direct product graph o two graphs1 = (1 1) and 2 = (2 2) denoted by 1 times 2 =(times times) where times is its node set and times is its edge set isde1047297ned as ollows

times 9830801 times 2983081 = 1048699983080V 1 1983081 isin 1 times 2 | 983080V 1983081 = 9830801983081983165 times 9830801 times 2983081 = 1048699983080983080V 1 1983081 983080V 2 2983081983081 isin times 9830801 times 2983081 |

983080V 1 V 2983081 isin 1 9830801 2983081 isin 2

983080V 1

V 2

983081 = 9830801

2

983081983165

(983089983092)

where (V ) is the label o a node V and ( ) is the label o an edge between node and node Based on the directproduct graph the random walk kernel can be calculated

Let 1038389times isin R|times|times|times| denote an adjacency matrix o the directproduct1 times2 With a weighting actorgraph ge 0graph in(983089983091) can be rewritten as

graph 9830801 2983081 = |times|sum1103925=1

1048667infinsum907317=0

907317graph1038389907317times1048669

1103925

(983089983093)

Tis random walk kernel can be computed in (11039253

) usingSylvester equation or Conjugate Gradient method where 1103925 isthe number o nodes [983091983092]

983093983091 Modi1047297ed Graph Kernel When a graph kernel is used tocompare source codes goodperormance is not expected dueto theact that the graph kernel measures similaritiesbetweenwalks with an identical label Since the labels (unctionnames) o nodes within the unction-call graph are decidedby human developers they are seldom identical even i thesource codes are simple Tereore the graph kernel has toconsider nonidentical labels

Borgwardt et al modi1047297ed the random walk kernel to

compare nonidentical labels by changing the direct productgraph to include all pairs o nodes and edges [983091983093] Assumethat nodes are compared by a node kernelnode and edges arecompared by an edge kernel kedge edge Tat is node(V )calculates the similarity between two labels rom the nodes V

and andedge((V 1103925 V 1103925+1)(1103925 1103925+1)) computes the similarity between two edges (V 1103925 V 1103925+1) and (1103925 1103925+1) With these twokernels the random walk kernel between two unction callgraphs 1 and 2 is now de1047297ned as

mg 9830801 2983081 = 907317minus1sum1103925=1

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 (983089983094)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 813

983096 Mathematical Problems in Engineering

where

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081= node 983080V 1103925 1103925983081 sdot node 983080V 1103925+1 1103925+1983081

sdot edge

983080983080V 1103925

V 1103925+1

983081 9830801103925

1103925+1

983081983081

(983089983095)

I this modi1047297ed random walk kernel is used or thecomparison o source code the node kernel node and theedge kernel edge should be de1047297ned Note that the labelso edges in unction-call graph are binary values by (983089983089)Tus edge is simply designed to compare binary valuesTe simplest orm or node(V ) is a unction that returns983089 when V and have similar string patterns 983088 otherwiseTat is it returns 983089 i a distance between V and is smallerthan a prede1047297ned threshold In this paper we simply useLevenshtein distance as the distance and set the threshold as983088983093

Te modi1047297ed random walk kernel mg can be alsocomputed using (983089983093) However the adjacency matrix

1038389times o

the direct product 1 times 2 should be modi1047297ed as

9831311038389times983133(V 10383891038389)(V 907317907317)

= 852091step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081

i 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 isin times0 otherwise(983089983096)

wheretimes isashortormo times(1times2) and the edges (V 1103925 V 1103925+1)and (1103925 1103925+1) belong to 1 and 2 respectively As in parsetree kernel this modi1047297ed graph kernel is used to comparesource codes as similarity measure sim( ) in (983089)

6 Similarity Measure for Source CodesBased on a Composite Kernel

Te modi1047297ed parse tree kernel manages syntactic structuralinormation whereas the modi1047297ed graph kernel considershigh-level topological inormation o source codes In orderto make use o both kinds o inormation the compositiono the two kernels is required Cristianini and Shawe-aylorproved that a new kernel can be obtained by combiningexisting several kernels with some closure properties suchas weighted sum and multiplication [983091983094] Among variousclosure properties this paper adopts the weighted sum sinceit is simple and widely used

Beore combining two kernels the kernels should benormalized since the modi1047297ed parse tree kernel mpt andthe modi1047297ed graph kernelmg are not bound Tereore onekernel can dominate other in their composition In order toremove this effect the kernels are 1047297rst normalized When akernel() is given itsnormalized one()isde1047297nedas

= radic ( ) sdot 983080 983081 (983089983097)

Tereore () is bounded between 983088 and 983089 Tat is 0 le

(

) le 1

Our composite kernelco is composed o the normalizedmodi1047297ed parse tree kernel

mpt and the normalized modi1047297ed

graph kernelmg Tatis the composite kernel

co or given

two source codes and is de1047297ned as

co

= 9830801 minus 983081 sdot mpt

983080

983081 + sdot mg

983080

983081 (983090983088)

where is a mixing weight between two kernels and areparse trees extracted rom source codes and respectivelyand and are call graphs rom and respectively Telarger is the more signi1047297cant mpt is On the other handas the value o gets small the graph kernel mg is moresigni1047297cant than the parse tree kernel mpt Tis compositekernel is used as our 1047297nal similarity measure sim( ) in (983089)

Te parse tree kernel compares source codes with local-level view since it is based on subtree comparison Most pla-giarized source codes change a small portion o the originalsource code Tus the parse tree kernel has shown goodperormances in general However it does not re1047298ect the 1047298ow o the program which is dynamic structural inormationTegraph kernel on the other hand calculates the similarity interms o dynamic high-level view Tus when source codesconsist o a number o unctions the graph kernel achievesreasonable perormance As a result should be determinedby the complexity o source codes since it is a parameter tocontrol the relative importance between the parse tree kerneland the graph kernel

Tere are many methodsthat measure the complexity o asource code One widely-used method is the cyclomatic com-plexity proposed by McCabe [983091983095] Te cyclomatic complexity is a graph-theoretic quantitative metric and measures thenumber o paths within a source code It is simply calculated

using a control 1047298ow graph o a source code where the nodeso the graph correspond to entities o the source code andan (directed) edge between two nodes implies a dependency relation between entities Given the control 1047298ow graph o asource code () the cyclomatic complexity o source code is de1047297ned as

() = minus + 2 (983090983089)

where is the number o edges o the graph is the numbero nodes and is the number o connected components Telarger is the more complicated source code is

In this paper we measure the complexity o a sourcecode using itsunction-call graph Since a unction-call graphrepresents dependencies among unctions within a programit can be considered as a kind o control 1047298ow graphs whereentities o the source code are the unction in the source codeand the edge implies dependencies between unctions

Let() and() be the cyclomatic complexities o twosource codes and respectively Since is the weight o two normalized kernels it has to normalize between 983088 and 983089Te sigmoid unction is de1047297ned or all real input values andreturns a positive value between 983088 and 983089 Tus the sigmoidunction is adopted or o (983090983088) and is de1047297ned as

= 1

1 + minus(min(()())minus25)

(983090983090)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 913

Mathematical Problems in Engineering 983097

983137983138983148983141 983089 Simple statistics on the real data set

Inormation Value

Number o total assignments 983091983094

Number o submitted source codes 983093983093983093

Average number o submitted codes per assignment 983089983093983092983090

Minimum number o lines in source code 983092983097Maximum number o lines in source code 983090983096983094983091

Average number o lines per source code 983091983088983093983088983095

Minimum number o nodes in source code 983089983090

Maximum number o nodes in source code 983092983092983095

Average number o nodes in source code 983094983092983090983097

Number o marked plagiarism pairs 983089983095983093

where min() returns the minimum value between and According to (983090983090) as the cyclomatic complexity gets larger also increases is to be 983088983093 when thecyclomatic complexity o source code is 983090983093 Tis indicates

that when the cyclomatic complexity o source code is983090983093 the parse tree kernel and the graph kernel have anequal importance in the composite kernel A number o source code analysis applications regard source codes whosecyclomatic complexity is more than 983090983093 as complicated codes(httpmsdnmicrosofcomen-uslibraryms983089983096983090983090983089983090aspx )Tus we set 983090983093 as the equal point o the importance betweenthe parse tree kernel and the graph kernel

7 Experiments

983095983089 Experimental Settings For experiments the same dataset in the work o Son et al [983093] is used Tis data setis collected rom actual programming assignments o Javaclasses submitted by undergraduate students rom 983090983088983088983093 to983090983088983088983097 able 983089 shows simple statistics o the data set Te totalnumber o programming assignments is 983091983094 and the numbero submitted source codes is 983093983093983093 or the 983091983094 assignments Tusthe average number o source codes per an assignment is983089983093983092983090

Figure 983092 shows the histogram o the source codes perlines Te -axis is the number o program lines and the -axis represents the number o source codes As shown in this1047297gure about 983095983093 o source codes are written with less than983092983088983088 lines Te minimum number o lines o a source code is983092983097 and the maximum is 983090983096983094983091 Te average number o lines

per code is 983091983088983093983088983095In our data set the minimum number o unctions within

a program is 983089983090 whereas the maximum number is 983092983092983095Te programs with larger number o programs are paintprograms with any buttons In the paint programs studentsare required to set a layout manually with raw unctionssuch assetBounds Tus paint programs have a number o unctions Te average number o unctions is 983094983092983090983095

wo annotators created the gold standard or this dataset Tey investigated all sourcecodes and markedplagiarizedpairs manually In order to measurethe reliability andvalidity o the annotators Cohenrsquos kappa agreement [983091983096] is measuredTe kappa agreement o the annotators is

= 093 whichalls

180

160

140

120

100

80

60

40

20

00 500 1000 1500 2000 2500 3000

Source code lines

N u m

b e r o f s

o u r c e c o

d e s

F983145983143983157983154983141 983092 Histogram o source lines per code

on the category o ldquoalmost perect agreementrdquo Only the pairs judged as plagiarized pairs by both annotators are regardedas actual plagiarized pairs In total 983089983095983093 pairs are marked asplagiarized pairs

Tree metrics are used as evaluation measure precisionrecall and 1-measure Tey are calculated as ollows

Precision = o correctly detected plagiarized pairs

o detected plagiarized pairs

Recall = o correctly detected plagiarized pairs

o true plagiarized pairs

1-measure = 2 sdot precision

sdot recall

precision + recall (983090983091)

In order to evaluate the proposed method severalbaseline systems are used It is compared with JPlag andCCFinder In all experiments or the parse tree kernel thethreshold or subtree depth Δ is set as 983091 and the decay actortree is 983088983089 Te decay actor graph or graph kernel is set tobe 983088983089 empirically in (983090983089) is set to be 983089 because each sourcecode in our data set is a single program

983095983090 Experimental Results Beore evaluating the perormanceo plagiarism detection we 1047297rst examine relatedness between

the number o source code lines and the cyclomatic complex-ity Tis examination tries to show that (983090983090) is easible Since is determined with cyclomatic complexity it is expected to beproportionalto cyclomatic complexity Figure 983093 shows scatterplot between the number o lines and cyclomatic complexityAs shown in this 1047297gure they are highly correlated with eachother in our data set Te Pearson correlation coefficient is983088983095983089983092 Tis result implies that it is easible to set in (983090983090) to beproportional to cyclomatic complexity

In order to see the effect o threshold in (983089) in ourmethod the perormances are measured according to the

values o Figure 983094 shows the perormance o the pro-posed method or various

As

increases the precision

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1013

983089983088 Mathematical Problems in Engineering

600

500

400

300

200

100

0

C y c

l o m a t i c

c o m p l e x i t y

0 500 1000 1500 2000 2500 3000

Number of lines

F983145983143983157983154983141 983093 Scatter plot according to the number o lines o sourcecodes and the corresponding cyclomatic compelxity

10

09

08

07

06

05

086 088 092090 094 096 098

RecallPrecision

F -measure

Te value of threshold

F983145983143983157983154983141 983094 Perormance o the proposed system or real-world dataset

also increases while the recall decreases slightly Te bestperormance is achieved at = 096 with 983088983096983095 o 1-measureTus

= 096 is used at all the experiments below

Figure 983095 compares the proposed method with variouskernels according to the number o source code lines In this1047297gure the -axis is the number lines o source code andthe -axis represent the average 1-measure As shown inthis 1047297gure the original graph kernel shows the worst per-ormance Since it uses only the graph structure o unctioncalls it ofen ails in calculating the similarity among sourcecodes For example assume that there are two source codesIn one source code main calls a unction add and addcalls another unction multiply In the other source code main calls multiply and multiply calls add Tese twosource codes are the same under the graph kernel since thelabel inormation is ignored by the graph kernel Without

Graph kernel

Parse tree kernel

Modi1047297ed graph kernel

Proposed method

10

08

06

04

02

00

A v e r a g e F 1

- m e a s u r e

lt100 lt200 lt300 lt400 gt400

Number of lines

F983145983143983157983154983141 983095 Average 1-measure according to the number o source

code lines

the labels these two graphs are identical On the other handthe modi1047297ed graph kernel utilizes the label inormation As aresult it achieves better perormance than the graph kernel

Te parse tree kernel achieves higher perormance thanother methods or the source codes with less than 983091983088983088 linesWhen the number o lines in source codes is small theplagiarized codes are ofen made by changing the originalone locally Tus the parse tree kernel detects plagiarizedpairs accurately or the codes with small number o linesWhen source codes have more than 983091983088983088 lines the modi1047297edgraph kernel shows slightly better perormance than theparse tree kernel Tis result implies that high-level structuralinormation is another actor to compare (large) source codesand the modi1047297ed graph kernel can re1047298ect this structuralinormation well

Te proposed method that combines the parse tree kerneland the modi1047297ed graph kernel achieves the best perormanceor all source codes except those with 300 sim 400 linesSince the cyclomatic complexity o source codes with 300 sim400 lines is near 983090983093 the proposed method re1047298ects the parsetree kernel and the modi1047297ed graph kernel equally Tusit achieves an average perormance o the kernels By thecyclomatic complexity o source codes the proposed method

is more in1047298uenced by the parse tree kernel when a sourcecode is small I a source code is large the effect o graphkernel is larger than that o the parse tree kernel From theresults it can be concluded that the proposed method doesnot consider only local-level structural inormation but alsohigh-level structural inormation effectively

Te 1047297nal 1-measure o program source code plagiarismdetection is given in able 983090 Te proposed method showsthe best 1-measure compared to other kernels or opensource plagiarism systems Te difference o 1-measure is983088983090983097 against JPlag 983088983089983095 against CCFinder 983088983088983096 against themodi1047297ed graph and 983088983088983093 against the modi1047297ed parse treekernel Tis result implies that or source code plagiarism

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 5: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 513

Mathematical Problems in Engineering 983093

public class Fibo public static int rFibonacci(int number)

if (number == 983089 number == 983090) return 983089

return rFibonacci(number-983089) + rFibonacci(number-983090)

private static int sum(int value983089 int value983090) return value983089 + value983090

public static int iFibonacci(int number) if (number == 983089 number == 983090)

return 983089int 1047297bo983089 = 983089 1047297bo983090 = 983089int 1047297bonacci = initOne()for (int i = 983091 i lt= number i++)

1047297bonacci = sum(1047297bo983089 1047297bo983090)1047297bo983089 = 1047297bo9830901047297bo983090 = 1047297bonacci

return 1047297bonacciprivate static int initOne()

return 983089public static void main(String[] args)

int rFibo = FiborFibonacci(983095)int iFibo = FiboiFibonacci(983095)Systemout println(rFibo)Systemout println(iFibo)

B983151983160 983089 An example o Java source code

Since parse tree has syntactic structural inormation ametric or parse tree that re1047298ects entire structural inorma-tion is required Te parse tree kernel is one o such metricsIt compares parse trees without manually designed structuraleatures

983092983090 Parse ree Kernel Parse tree kernel is a kernel that is

designed to compare tree structures such as parse trees o natural language sentences Tis kernel maps a parse treeonto a space spanned by all subtrees that can appear possibly in the parse tree Te explicit enumeration o all subtreesis computationally ineasible since the number o subtreesincreases exponentially as the size o tree grows Collins andDuffy proposed a method to compute the inner product o two trees without having to enumerate all subtrees [983097]

Let subtree1 subtree2 be all o the subtrees in a parsetree Ten can be represented as a vector

= ⟨subtree1 () subtree2 () subtree1038389 ()⟩ (983090)

where subtree1103925() is the requency o subtree1103925 in the parsetree Te kernel unction between two parse trees 1 and2 is de1047297ned as tree(1 2) = 12 and is determined as

tree 9830801 2983081 = 12= sum1103925 subtree1103925 9830801983081 sdot subtree1103925 9830802983081= sum

1103925

1048616 sum9073171isin1

subtree1038389983080110392519830811048617

sdot 1048616 sum9073172isin2

subtree1038389983080110392529830811048617

= sum9073171isin1

sum9073172isin2

98308011039251 11039252983081

(983091)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 613

983094 Mathematical Problems in Engineering

where 1and 2

are all the nodes in trees 1 and 2 Teindicator unction subtree1038389

(1103925) is 983089 i subtree1103925 is rooted at node1103925 and 983088 otherwise (11039251 11039252) is a unction which is de1047297ned as

98308011039251 11039252983081 = sum1103925

subtree103838998308011039251983081 sdot subtree1038389

98308011039252983081 (983092)

Tis unction can be calculated in polynomial time using theollowing recursive de1047297nition

(i) I the productions at 11039251 and 11039252 are different

98308011039251 11039252983081 = 0 (983093)

(ii) I both 11039251 and 11039252 are preterminals

98308011039251 11039252983081 = 1 (983094)

(iii) Otherwise the unction can be de1047297ned as ollow

98308011039251

11039252

983081 = nc(9073171)

prod1103925 9830801 + 983080ch1103925

98308011039251

983081 ch1103925

98308011039252

983081983081983081 (983095)

where nc(11039251) is the number o children o node 11039251 inthe tree

Since the productions at 11039251 and 11039252 are the same nc(11039251)is also equal to nc(11039252) Here ch1103925(11039251) denotes the th childnode o 11039251 Tis recursive algorithm is based on the act thatall subtrees rooted at a certain node can be constructed by combining the subtrees rooted at each o its children

983092983091 Modi1047297ed Parse ree Kernel Te parse tree kernel hasshown good perormance or parse trees o natural language

but it does not work well or program source code compari-son due to two issues Te 1047297rst issue is asymmetric in1047298uenceo node changes Te parse tree rom a source code tends tobe much larger and deeper than that rom a natural languagesentence Tereore the changes near root node have beenre1047298ected more ofen than the changes near lea nodes Tesecond issue is the sequence o subtrees Te original parsetree kernel counts the sequence o subtrees by consideringtheir order However the order o two substructures in asource code is meaningless in programming languages

Son et al proposed a modi1047297ed parse tree kernel to copewith these issues [983093] In order to solve the 1047297rst issue they introduced a decay actor tree and a threshold Δ that controlthe effect o large subtrees Te decay actor scales the relativeimportance o subtrees by their size As the depth o a subtreeincreases the kernel value o the subtree is penalized by (tree)size where size is the depth o the subtree In additionthe limitation o the maximum depth o possible subtrees isset as Δ so that the effect o large subtrees could be reducedTe second issue is solved by changing unction in (983095) toignore the order o two nodes

With a decay actor tree and a threshold Δ the recursiverules o the parse tree kernel is modi1047297ed as ollows

(i) I 11039251 and 11039252 are different

98308011039251

11039252

983081 = 0 (983096)

(ii) I both 11039251 and 11039252 are terminals or the current depth isequal to Δ

98308011039251 11039252983081 = tree (983097)

Equation (983095) cannot be used with these new recursive rules

since the number o child nodes can be different in 11039251 and 11039252Tus we adopt the maximum similarity between child nodesAs a result the unction in (983095) becomes

98308011039251 11039252983081 = tree

nc(9073171)prod1103925

8520081 + maxchisinch11039252

983080ch1103925 98308011039251983081 ch983081852009 (983089983088)

where ch9073172is a set o child nodes o 11039252

Te parse tree kernel with the modi1047297ed unctionmptdoes not satisy Mercerrsquos condition However many unctionsthat do not satisy Mercerrsquos condition [983091983088 983091983089] work wellin computing similarity [983091983090] Finally this parse tree kernelis used as the similarity measure sim

( ) in (983089) or syntactic

structural comparison o source codes

5 Similarity Measure for Source CodesBased on Graph Kernel

983093983089 Source Code as a Graph Recently program sourcecodes are written with object-oriented concepts and severalreactoring techniques so that the codes are getting more andmore modularized at unctional level Since a source codeencodes program logic to solve a problem the execution 1047298ow at unction level is one o the important actors to identiy the source code Tereore this unction-level 1047298ow should beconsidered to compare source codes

One possible representation or the unction-level 1047298ow isa unction-call graph which represents dependencies amongunctions within a program Let be a source code Ten a

unction calls graph = ( ) is a directed graph extractedrom where V isin is a unction in Tus || is the numbero unctions in is a set o edges and each edge represents adependency relation between unctions Tat is an edge 1103925 isin that connects nodesV 1103925 and V implies that a unction V 1103925 callsanother unction V Te weight o an edge is given by

1103925 = 9831631 i V 1103925 calls V 0 otherwise (983089983089)

Figure 983091 illustrates an example unction-call graphextracted rom the Java code in Box 983089 Tis code contains 1047297veunctions including main First main calls rFibonacciand iFibonacci in order Since rFibonacci is arecursive unction it calls itsel iFibonacci calls twounctions initOne and sum to initialize a variable and geta sum Finally main calls println to print out the results

A rule-based approach is adopted to extract a call graphrom a source code Simple rules are used to 1047297nd thecaller-callee relationship rom a parse tree For instance inJava a rule ldquoif lsquoexpression (expresionList)rsquo is oundthen expression is a called unction namerdquo is used to1047297nd subtrees rom a parse tree Ten unction names and

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 713

Mathematical Problems in Engineering 983095

rFibonacci

iFibonacci

main

initOne

sum

println

F983145983143983157983154983141 983091 Example o extracted call graph rom Java code

parameters are extracted rom the matched subtrees Tenodes or the extracted unction names are connected to thecaller node

Function-call graph o a program depicts how the pro-gram executes at unction level and how unctions are relatedto one another Since the 1047298ow o a program is quite uniqueaccording to the task the similarity between two sources canbe calculated using the 1047298ows o the programs Since this 1047298ow is represented as a unction-call graph the graph kernel is thebest method to compare unction-call graphs It showed goodperormance in several 1047297elds including biology and social

network analysis

983093983090 Graph Kernel Graph kernel is a kernel thatis designed tocompare graph structures Like the parse tree kernel a graphis mapped onto a eature space spanned by their subgraphs inthe graph kernel Te intuitive eature o the graph kernel isgraph isomorphism that determines the topological identityAccording to Gartner et al [983089983088] however it is as hard asdeciding whether two graphs are isomorphic to compute any complete graph kernel with an injective mapping unctionor all graphs where graph isomorphism is a NP-completeproblem [983091983091] Tus most graph kernels ocus on alternativeeature representation o graphs

Te random walk graph kernel is one o the most widely used graph kernels It uses all possible random walks aseatures or graphs Let be a set o all possible random walksandW907317() denotes the set o all possible walks with 1103925 edgesin graph For each random walk isin whose length is 1103925the corresponding eature mapping unction o a graph isgiven as

Φ () = radic907317graph1048699 isinW907317 () forall 9830801103925983081 = 9830801103925983081983165 (983089983090)

where 907317graph is a weight or the length 1103925 and (1103925) and (1103925)are the

th label o the random walk

and

respectively

Te kernel unction between two graphs 1 and 2 denotedby graph(1 2) can be de1047297ned as

graph 9830801 2983081 = sum

Φ 9830801983081 sdotΦ 9830802983081 (983089983091)

Gartner et al proposed an approach to calculate all ran-domwalks within two graphswithout explicit enumeration o all random walks [983089983088] A direct product graph o two graphs1 = (1 1) and 2 = (2 2) denoted by 1 times 2 =(times times) where times is its node set and times is its edge set isde1047297ned as ollows

times 9830801 times 2983081 = 1048699983080V 1 1983081 isin 1 times 2 | 983080V 1983081 = 9830801983081983165 times 9830801 times 2983081 = 1048699983080983080V 1 1983081 983080V 2 2983081983081 isin times 9830801 times 2983081 |

983080V 1 V 2983081 isin 1 9830801 2983081 isin 2

983080V 1

V 2

983081 = 9830801

2

983081983165

(983089983092)

where (V ) is the label o a node V and ( ) is the label o an edge between node and node Based on the directproduct graph the random walk kernel can be calculated

Let 1038389times isin R|times|times|times| denote an adjacency matrix o the directproduct1 times2 With a weighting actorgraph ge 0graph in(983089983091) can be rewritten as

graph 9830801 2983081 = |times|sum1103925=1

1048667infinsum907317=0

907317graph1038389907317times1048669

1103925

(983089983093)

Tis random walk kernel can be computed in (11039253

) usingSylvester equation or Conjugate Gradient method where 1103925 isthe number o nodes [983091983092]

983093983091 Modi1047297ed Graph Kernel When a graph kernel is used tocompare source codes goodperormance is not expected dueto theact that the graph kernel measures similaritiesbetweenwalks with an identical label Since the labels (unctionnames) o nodes within the unction-call graph are decidedby human developers they are seldom identical even i thesource codes are simple Tereore the graph kernel has toconsider nonidentical labels

Borgwardt et al modi1047297ed the random walk kernel to

compare nonidentical labels by changing the direct productgraph to include all pairs o nodes and edges [983091983093] Assumethat nodes are compared by a node kernelnode and edges arecompared by an edge kernel kedge edge Tat is node(V )calculates the similarity between two labels rom the nodes V

and andedge((V 1103925 V 1103925+1)(1103925 1103925+1)) computes the similarity between two edges (V 1103925 V 1103925+1) and (1103925 1103925+1) With these twokernels the random walk kernel between two unction callgraphs 1 and 2 is now de1047297ned as

mg 9830801 2983081 = 907317minus1sum1103925=1

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 (983089983094)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 813

983096 Mathematical Problems in Engineering

where

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081= node 983080V 1103925 1103925983081 sdot node 983080V 1103925+1 1103925+1983081

sdot edge

983080983080V 1103925

V 1103925+1

983081 9830801103925

1103925+1

983081983081

(983089983095)

I this modi1047297ed random walk kernel is used or thecomparison o source code the node kernel node and theedge kernel edge should be de1047297ned Note that the labelso edges in unction-call graph are binary values by (983089983089)Tus edge is simply designed to compare binary valuesTe simplest orm or node(V ) is a unction that returns983089 when V and have similar string patterns 983088 otherwiseTat is it returns 983089 i a distance between V and is smallerthan a prede1047297ned threshold In this paper we simply useLevenshtein distance as the distance and set the threshold as983088983093

Te modi1047297ed random walk kernel mg can be alsocomputed using (983089983093) However the adjacency matrix

1038389times o

the direct product 1 times 2 should be modi1047297ed as

9831311038389times983133(V 10383891038389)(V 907317907317)

= 852091step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081

i 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 isin times0 otherwise(983089983096)

wheretimes isashortormo times(1times2) and the edges (V 1103925 V 1103925+1)and (1103925 1103925+1) belong to 1 and 2 respectively As in parsetree kernel this modi1047297ed graph kernel is used to comparesource codes as similarity measure sim( ) in (983089)

6 Similarity Measure for Source CodesBased on a Composite Kernel

Te modi1047297ed parse tree kernel manages syntactic structuralinormation whereas the modi1047297ed graph kernel considershigh-level topological inormation o source codes In orderto make use o both kinds o inormation the compositiono the two kernels is required Cristianini and Shawe-aylorproved that a new kernel can be obtained by combiningexisting several kernels with some closure properties suchas weighted sum and multiplication [983091983094] Among variousclosure properties this paper adopts the weighted sum sinceit is simple and widely used

Beore combining two kernels the kernels should benormalized since the modi1047297ed parse tree kernel mpt andthe modi1047297ed graph kernelmg are not bound Tereore onekernel can dominate other in their composition In order toremove this effect the kernels are 1047297rst normalized When akernel() is given itsnormalized one()isde1047297nedas

= radic ( ) sdot 983080 983081 (983089983097)

Tereore () is bounded between 983088 and 983089 Tat is 0 le

(

) le 1

Our composite kernelco is composed o the normalizedmodi1047297ed parse tree kernel

mpt and the normalized modi1047297ed

graph kernelmg Tatis the composite kernel

co or given

two source codes and is de1047297ned as

co

= 9830801 minus 983081 sdot mpt

983080

983081 + sdot mg

983080

983081 (983090983088)

where is a mixing weight between two kernels and areparse trees extracted rom source codes and respectivelyand and are call graphs rom and respectively Telarger is the more signi1047297cant mpt is On the other handas the value o gets small the graph kernel mg is moresigni1047297cant than the parse tree kernel mpt Tis compositekernel is used as our 1047297nal similarity measure sim( ) in (983089)

Te parse tree kernel compares source codes with local-level view since it is based on subtree comparison Most pla-giarized source codes change a small portion o the originalsource code Tus the parse tree kernel has shown goodperormances in general However it does not re1047298ect the 1047298ow o the program which is dynamic structural inormationTegraph kernel on the other hand calculates the similarity interms o dynamic high-level view Tus when source codesconsist o a number o unctions the graph kernel achievesreasonable perormance As a result should be determinedby the complexity o source codes since it is a parameter tocontrol the relative importance between the parse tree kerneland the graph kernel

Tere are many methodsthat measure the complexity o asource code One widely-used method is the cyclomatic com-plexity proposed by McCabe [983091983095] Te cyclomatic complexity is a graph-theoretic quantitative metric and measures thenumber o paths within a source code It is simply calculated

using a control 1047298ow graph o a source code where the nodeso the graph correspond to entities o the source code andan (directed) edge between two nodes implies a dependency relation between entities Given the control 1047298ow graph o asource code () the cyclomatic complexity o source code is de1047297ned as

() = minus + 2 (983090983089)

where is the number o edges o the graph is the numbero nodes and is the number o connected components Telarger is the more complicated source code is

In this paper we measure the complexity o a sourcecode using itsunction-call graph Since a unction-call graphrepresents dependencies among unctions within a programit can be considered as a kind o control 1047298ow graphs whereentities o the source code are the unction in the source codeand the edge implies dependencies between unctions

Let() and() be the cyclomatic complexities o twosource codes and respectively Since is the weight o two normalized kernels it has to normalize between 983088 and 983089Te sigmoid unction is de1047297ned or all real input values andreturns a positive value between 983088 and 983089 Tus the sigmoidunction is adopted or o (983090983088) and is de1047297ned as

= 1

1 + minus(min(()())minus25)

(983090983090)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 913

Mathematical Problems in Engineering 983097

983137983138983148983141 983089 Simple statistics on the real data set

Inormation Value

Number o total assignments 983091983094

Number o submitted source codes 983093983093983093

Average number o submitted codes per assignment 983089983093983092983090

Minimum number o lines in source code 983092983097Maximum number o lines in source code 983090983096983094983091

Average number o lines per source code 983091983088983093983088983095

Minimum number o nodes in source code 983089983090

Maximum number o nodes in source code 983092983092983095

Average number o nodes in source code 983094983092983090983097

Number o marked plagiarism pairs 983089983095983093

where min() returns the minimum value between and According to (983090983090) as the cyclomatic complexity gets larger also increases is to be 983088983093 when thecyclomatic complexity o source code is 983090983093 Tis indicates

that when the cyclomatic complexity o source code is983090983093 the parse tree kernel and the graph kernel have anequal importance in the composite kernel A number o source code analysis applications regard source codes whosecyclomatic complexity is more than 983090983093 as complicated codes(httpmsdnmicrosofcomen-uslibraryms983089983096983090983090983089983090aspx )Tus we set 983090983093 as the equal point o the importance betweenthe parse tree kernel and the graph kernel

7 Experiments

983095983089 Experimental Settings For experiments the same dataset in the work o Son et al [983093] is used Tis data setis collected rom actual programming assignments o Javaclasses submitted by undergraduate students rom 983090983088983088983093 to983090983088983088983097 able 983089 shows simple statistics o the data set Te totalnumber o programming assignments is 983091983094 and the numbero submitted source codes is 983093983093983093 or the 983091983094 assignments Tusthe average number o source codes per an assignment is983089983093983092983090

Figure 983092 shows the histogram o the source codes perlines Te -axis is the number o program lines and the -axis represents the number o source codes As shown in this1047297gure about 983095983093 o source codes are written with less than983092983088983088 lines Te minimum number o lines o a source code is983092983097 and the maximum is 983090983096983094983091 Te average number o lines

per code is 983091983088983093983088983095In our data set the minimum number o unctions within

a program is 983089983090 whereas the maximum number is 983092983092983095Te programs with larger number o programs are paintprograms with any buttons In the paint programs studentsare required to set a layout manually with raw unctionssuch assetBounds Tus paint programs have a number o unctions Te average number o unctions is 983094983092983090983095

wo annotators created the gold standard or this dataset Tey investigated all sourcecodes and markedplagiarizedpairs manually In order to measurethe reliability andvalidity o the annotators Cohenrsquos kappa agreement [983091983096] is measuredTe kappa agreement o the annotators is

= 093 whichalls

180

160

140

120

100

80

60

40

20

00 500 1000 1500 2000 2500 3000

Source code lines

N u m

b e r o f s

o u r c e c o

d e s

F983145983143983157983154983141 983092 Histogram o source lines per code

on the category o ldquoalmost perect agreementrdquo Only the pairs judged as plagiarized pairs by both annotators are regardedas actual plagiarized pairs In total 983089983095983093 pairs are marked asplagiarized pairs

Tree metrics are used as evaluation measure precisionrecall and 1-measure Tey are calculated as ollows

Precision = o correctly detected plagiarized pairs

o detected plagiarized pairs

Recall = o correctly detected plagiarized pairs

o true plagiarized pairs

1-measure = 2 sdot precision

sdot recall

precision + recall (983090983091)

In order to evaluate the proposed method severalbaseline systems are used It is compared with JPlag andCCFinder In all experiments or the parse tree kernel thethreshold or subtree depth Δ is set as 983091 and the decay actortree is 983088983089 Te decay actor graph or graph kernel is set tobe 983088983089 empirically in (983090983089) is set to be 983089 because each sourcecode in our data set is a single program

983095983090 Experimental Results Beore evaluating the perormanceo plagiarism detection we 1047297rst examine relatedness between

the number o source code lines and the cyclomatic complex-ity Tis examination tries to show that (983090983090) is easible Since is determined with cyclomatic complexity it is expected to beproportionalto cyclomatic complexity Figure 983093 shows scatterplot between the number o lines and cyclomatic complexityAs shown in this 1047297gure they are highly correlated with eachother in our data set Te Pearson correlation coefficient is983088983095983089983092 Tis result implies that it is easible to set in (983090983090) to beproportional to cyclomatic complexity

In order to see the effect o threshold in (983089) in ourmethod the perormances are measured according to the

values o Figure 983094 shows the perormance o the pro-posed method or various

As

increases the precision

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1013

983089983088 Mathematical Problems in Engineering

600

500

400

300

200

100

0

C y c

l o m a t i c

c o m p l e x i t y

0 500 1000 1500 2000 2500 3000

Number of lines

F983145983143983157983154983141 983093 Scatter plot according to the number o lines o sourcecodes and the corresponding cyclomatic compelxity

10

09

08

07

06

05

086 088 092090 094 096 098

RecallPrecision

F -measure

Te value of threshold

F983145983143983157983154983141 983094 Perormance o the proposed system or real-world dataset

also increases while the recall decreases slightly Te bestperormance is achieved at = 096 with 983088983096983095 o 1-measureTus

= 096 is used at all the experiments below

Figure 983095 compares the proposed method with variouskernels according to the number o source code lines In this1047297gure the -axis is the number lines o source code andthe -axis represent the average 1-measure As shown inthis 1047297gure the original graph kernel shows the worst per-ormance Since it uses only the graph structure o unctioncalls it ofen ails in calculating the similarity among sourcecodes For example assume that there are two source codesIn one source code main calls a unction add and addcalls another unction multiply In the other source code main calls multiply and multiply calls add Tese twosource codes are the same under the graph kernel since thelabel inormation is ignored by the graph kernel Without

Graph kernel

Parse tree kernel

Modi1047297ed graph kernel

Proposed method

10

08

06

04

02

00

A v e r a g e F 1

- m e a s u r e

lt100 lt200 lt300 lt400 gt400

Number of lines

F983145983143983157983154983141 983095 Average 1-measure according to the number o source

code lines

the labels these two graphs are identical On the other handthe modi1047297ed graph kernel utilizes the label inormation As aresult it achieves better perormance than the graph kernel

Te parse tree kernel achieves higher perormance thanother methods or the source codes with less than 983091983088983088 linesWhen the number o lines in source codes is small theplagiarized codes are ofen made by changing the originalone locally Tus the parse tree kernel detects plagiarizedpairs accurately or the codes with small number o linesWhen source codes have more than 983091983088983088 lines the modi1047297edgraph kernel shows slightly better perormance than theparse tree kernel Tis result implies that high-level structuralinormation is another actor to compare (large) source codesand the modi1047297ed graph kernel can re1047298ect this structuralinormation well

Te proposed method that combines the parse tree kerneland the modi1047297ed graph kernel achieves the best perormanceor all source codes except those with 300 sim 400 linesSince the cyclomatic complexity o source codes with 300 sim400 lines is near 983090983093 the proposed method re1047298ects the parsetree kernel and the modi1047297ed graph kernel equally Tusit achieves an average perormance o the kernels By thecyclomatic complexity o source codes the proposed method

is more in1047298uenced by the parse tree kernel when a sourcecode is small I a source code is large the effect o graphkernel is larger than that o the parse tree kernel From theresults it can be concluded that the proposed method doesnot consider only local-level structural inormation but alsohigh-level structural inormation effectively

Te 1047297nal 1-measure o program source code plagiarismdetection is given in able 983090 Te proposed method showsthe best 1-measure compared to other kernels or opensource plagiarism systems Te difference o 1-measure is983088983090983097 against JPlag 983088983089983095 against CCFinder 983088983088983096 against themodi1047297ed graph and 983088983088983093 against the modi1047297ed parse treekernel Tis result implies that or source code plagiarism

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 6: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 613

983094 Mathematical Problems in Engineering

where 1and 2

are all the nodes in trees 1 and 2 Teindicator unction subtree1038389

(1103925) is 983089 i subtree1103925 is rooted at node1103925 and 983088 otherwise (11039251 11039252) is a unction which is de1047297ned as

98308011039251 11039252983081 = sum1103925

subtree103838998308011039251983081 sdot subtree1038389

98308011039252983081 (983092)

Tis unction can be calculated in polynomial time using theollowing recursive de1047297nition

(i) I the productions at 11039251 and 11039252 are different

98308011039251 11039252983081 = 0 (983093)

(ii) I both 11039251 and 11039252 are preterminals

98308011039251 11039252983081 = 1 (983094)

(iii) Otherwise the unction can be de1047297ned as ollow

98308011039251

11039252

983081 = nc(9073171)

prod1103925 9830801 + 983080ch1103925

98308011039251

983081 ch1103925

98308011039252

983081983081983081 (983095)

where nc(11039251) is the number o children o node 11039251 inthe tree

Since the productions at 11039251 and 11039252 are the same nc(11039251)is also equal to nc(11039252) Here ch1103925(11039251) denotes the th childnode o 11039251 Tis recursive algorithm is based on the act thatall subtrees rooted at a certain node can be constructed by combining the subtrees rooted at each o its children

983092983091 Modi1047297ed Parse ree Kernel Te parse tree kernel hasshown good perormance or parse trees o natural language

but it does not work well or program source code compari-son due to two issues Te 1047297rst issue is asymmetric in1047298uenceo node changes Te parse tree rom a source code tends tobe much larger and deeper than that rom a natural languagesentence Tereore the changes near root node have beenre1047298ected more ofen than the changes near lea nodes Tesecond issue is the sequence o subtrees Te original parsetree kernel counts the sequence o subtrees by consideringtheir order However the order o two substructures in asource code is meaningless in programming languages

Son et al proposed a modi1047297ed parse tree kernel to copewith these issues [983093] In order to solve the 1047297rst issue they introduced a decay actor tree and a threshold Δ that controlthe effect o large subtrees Te decay actor scales the relativeimportance o subtrees by their size As the depth o a subtreeincreases the kernel value o the subtree is penalized by (tree)size where size is the depth o the subtree In additionthe limitation o the maximum depth o possible subtrees isset as Δ so that the effect o large subtrees could be reducedTe second issue is solved by changing unction in (983095) toignore the order o two nodes

With a decay actor tree and a threshold Δ the recursiverules o the parse tree kernel is modi1047297ed as ollows

(i) I 11039251 and 11039252 are different

98308011039251

11039252

983081 = 0 (983096)

(ii) I both 11039251 and 11039252 are terminals or the current depth isequal to Δ

98308011039251 11039252983081 = tree (983097)

Equation (983095) cannot be used with these new recursive rules

since the number o child nodes can be different in 11039251 and 11039252Tus we adopt the maximum similarity between child nodesAs a result the unction in (983095) becomes

98308011039251 11039252983081 = tree

nc(9073171)prod1103925

8520081 + maxchisinch11039252

983080ch1103925 98308011039251983081 ch983081852009 (983089983088)

where ch9073172is a set o child nodes o 11039252

Te parse tree kernel with the modi1047297ed unctionmptdoes not satisy Mercerrsquos condition However many unctionsthat do not satisy Mercerrsquos condition [983091983088 983091983089] work wellin computing similarity [983091983090] Finally this parse tree kernelis used as the similarity measure sim

( ) in (983089) or syntactic

structural comparison o source codes

5 Similarity Measure for Source CodesBased on Graph Kernel

983093983089 Source Code as a Graph Recently program sourcecodes are written with object-oriented concepts and severalreactoring techniques so that the codes are getting more andmore modularized at unctional level Since a source codeencodes program logic to solve a problem the execution 1047298ow at unction level is one o the important actors to identiy the source code Tereore this unction-level 1047298ow should beconsidered to compare source codes

One possible representation or the unction-level 1047298ow isa unction-call graph which represents dependencies amongunctions within a program Let be a source code Ten a

unction calls graph = ( ) is a directed graph extractedrom where V isin is a unction in Tus || is the numbero unctions in is a set o edges and each edge represents adependency relation between unctions Tat is an edge 1103925 isin that connects nodesV 1103925 and V implies that a unction V 1103925 callsanother unction V Te weight o an edge is given by

1103925 = 9831631 i V 1103925 calls V 0 otherwise (983089983089)

Figure 983091 illustrates an example unction-call graphextracted rom the Java code in Box 983089 Tis code contains 1047297veunctions including main First main calls rFibonacciand iFibonacci in order Since rFibonacci is arecursive unction it calls itsel iFibonacci calls twounctions initOne and sum to initialize a variable and geta sum Finally main calls println to print out the results

A rule-based approach is adopted to extract a call graphrom a source code Simple rules are used to 1047297nd thecaller-callee relationship rom a parse tree For instance inJava a rule ldquoif lsquoexpression (expresionList)rsquo is oundthen expression is a called unction namerdquo is used to1047297nd subtrees rom a parse tree Ten unction names and

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 713

Mathematical Problems in Engineering 983095

rFibonacci

iFibonacci

main

initOne

sum

println

F983145983143983157983154983141 983091 Example o extracted call graph rom Java code

parameters are extracted rom the matched subtrees Tenodes or the extracted unction names are connected to thecaller node

Function-call graph o a program depicts how the pro-gram executes at unction level and how unctions are relatedto one another Since the 1047298ow o a program is quite uniqueaccording to the task the similarity between two sources canbe calculated using the 1047298ows o the programs Since this 1047298ow is represented as a unction-call graph the graph kernel is thebest method to compare unction-call graphs It showed goodperormance in several 1047297elds including biology and social

network analysis

983093983090 Graph Kernel Graph kernel is a kernel thatis designed tocompare graph structures Like the parse tree kernel a graphis mapped onto a eature space spanned by their subgraphs inthe graph kernel Te intuitive eature o the graph kernel isgraph isomorphism that determines the topological identityAccording to Gartner et al [983089983088] however it is as hard asdeciding whether two graphs are isomorphic to compute any complete graph kernel with an injective mapping unctionor all graphs where graph isomorphism is a NP-completeproblem [983091983091] Tus most graph kernels ocus on alternativeeature representation o graphs

Te random walk graph kernel is one o the most widely used graph kernels It uses all possible random walks aseatures or graphs Let be a set o all possible random walksandW907317() denotes the set o all possible walks with 1103925 edgesin graph For each random walk isin whose length is 1103925the corresponding eature mapping unction o a graph isgiven as

Φ () = radic907317graph1048699 isinW907317 () forall 9830801103925983081 = 9830801103925983081983165 (983089983090)

where 907317graph is a weight or the length 1103925 and (1103925) and (1103925)are the

th label o the random walk

and

respectively

Te kernel unction between two graphs 1 and 2 denotedby graph(1 2) can be de1047297ned as

graph 9830801 2983081 = sum

Φ 9830801983081 sdotΦ 9830802983081 (983089983091)

Gartner et al proposed an approach to calculate all ran-domwalks within two graphswithout explicit enumeration o all random walks [983089983088] A direct product graph o two graphs1 = (1 1) and 2 = (2 2) denoted by 1 times 2 =(times times) where times is its node set and times is its edge set isde1047297ned as ollows

times 9830801 times 2983081 = 1048699983080V 1 1983081 isin 1 times 2 | 983080V 1983081 = 9830801983081983165 times 9830801 times 2983081 = 1048699983080983080V 1 1983081 983080V 2 2983081983081 isin times 9830801 times 2983081 |

983080V 1 V 2983081 isin 1 9830801 2983081 isin 2

983080V 1

V 2

983081 = 9830801

2

983081983165

(983089983092)

where (V ) is the label o a node V and ( ) is the label o an edge between node and node Based on the directproduct graph the random walk kernel can be calculated

Let 1038389times isin R|times|times|times| denote an adjacency matrix o the directproduct1 times2 With a weighting actorgraph ge 0graph in(983089983091) can be rewritten as

graph 9830801 2983081 = |times|sum1103925=1

1048667infinsum907317=0

907317graph1038389907317times1048669

1103925

(983089983093)

Tis random walk kernel can be computed in (11039253

) usingSylvester equation or Conjugate Gradient method where 1103925 isthe number o nodes [983091983092]

983093983091 Modi1047297ed Graph Kernel When a graph kernel is used tocompare source codes goodperormance is not expected dueto theact that the graph kernel measures similaritiesbetweenwalks with an identical label Since the labels (unctionnames) o nodes within the unction-call graph are decidedby human developers they are seldom identical even i thesource codes are simple Tereore the graph kernel has toconsider nonidentical labels

Borgwardt et al modi1047297ed the random walk kernel to

compare nonidentical labels by changing the direct productgraph to include all pairs o nodes and edges [983091983093] Assumethat nodes are compared by a node kernelnode and edges arecompared by an edge kernel kedge edge Tat is node(V )calculates the similarity between two labels rom the nodes V

and andedge((V 1103925 V 1103925+1)(1103925 1103925+1)) computes the similarity between two edges (V 1103925 V 1103925+1) and (1103925 1103925+1) With these twokernels the random walk kernel between two unction callgraphs 1 and 2 is now de1047297ned as

mg 9830801 2983081 = 907317minus1sum1103925=1

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 (983089983094)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 813

983096 Mathematical Problems in Engineering

where

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081= node 983080V 1103925 1103925983081 sdot node 983080V 1103925+1 1103925+1983081

sdot edge

983080983080V 1103925

V 1103925+1

983081 9830801103925

1103925+1

983081983081

(983089983095)

I this modi1047297ed random walk kernel is used or thecomparison o source code the node kernel node and theedge kernel edge should be de1047297ned Note that the labelso edges in unction-call graph are binary values by (983089983089)Tus edge is simply designed to compare binary valuesTe simplest orm or node(V ) is a unction that returns983089 when V and have similar string patterns 983088 otherwiseTat is it returns 983089 i a distance between V and is smallerthan a prede1047297ned threshold In this paper we simply useLevenshtein distance as the distance and set the threshold as983088983093

Te modi1047297ed random walk kernel mg can be alsocomputed using (983089983093) However the adjacency matrix

1038389times o

the direct product 1 times 2 should be modi1047297ed as

9831311038389times983133(V 10383891038389)(V 907317907317)

= 852091step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081

i 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 isin times0 otherwise(983089983096)

wheretimes isashortormo times(1times2) and the edges (V 1103925 V 1103925+1)and (1103925 1103925+1) belong to 1 and 2 respectively As in parsetree kernel this modi1047297ed graph kernel is used to comparesource codes as similarity measure sim( ) in (983089)

6 Similarity Measure for Source CodesBased on a Composite Kernel

Te modi1047297ed parse tree kernel manages syntactic structuralinormation whereas the modi1047297ed graph kernel considershigh-level topological inormation o source codes In orderto make use o both kinds o inormation the compositiono the two kernels is required Cristianini and Shawe-aylorproved that a new kernel can be obtained by combiningexisting several kernels with some closure properties suchas weighted sum and multiplication [983091983094] Among variousclosure properties this paper adopts the weighted sum sinceit is simple and widely used

Beore combining two kernels the kernels should benormalized since the modi1047297ed parse tree kernel mpt andthe modi1047297ed graph kernelmg are not bound Tereore onekernel can dominate other in their composition In order toremove this effect the kernels are 1047297rst normalized When akernel() is given itsnormalized one()isde1047297nedas

= radic ( ) sdot 983080 983081 (983089983097)

Tereore () is bounded between 983088 and 983089 Tat is 0 le

(

) le 1

Our composite kernelco is composed o the normalizedmodi1047297ed parse tree kernel

mpt and the normalized modi1047297ed

graph kernelmg Tatis the composite kernel

co or given

two source codes and is de1047297ned as

co

= 9830801 minus 983081 sdot mpt

983080

983081 + sdot mg

983080

983081 (983090983088)

where is a mixing weight between two kernels and areparse trees extracted rom source codes and respectivelyand and are call graphs rom and respectively Telarger is the more signi1047297cant mpt is On the other handas the value o gets small the graph kernel mg is moresigni1047297cant than the parse tree kernel mpt Tis compositekernel is used as our 1047297nal similarity measure sim( ) in (983089)

Te parse tree kernel compares source codes with local-level view since it is based on subtree comparison Most pla-giarized source codes change a small portion o the originalsource code Tus the parse tree kernel has shown goodperormances in general However it does not re1047298ect the 1047298ow o the program which is dynamic structural inormationTegraph kernel on the other hand calculates the similarity interms o dynamic high-level view Tus when source codesconsist o a number o unctions the graph kernel achievesreasonable perormance As a result should be determinedby the complexity o source codes since it is a parameter tocontrol the relative importance between the parse tree kerneland the graph kernel

Tere are many methodsthat measure the complexity o asource code One widely-used method is the cyclomatic com-plexity proposed by McCabe [983091983095] Te cyclomatic complexity is a graph-theoretic quantitative metric and measures thenumber o paths within a source code It is simply calculated

using a control 1047298ow graph o a source code where the nodeso the graph correspond to entities o the source code andan (directed) edge between two nodes implies a dependency relation between entities Given the control 1047298ow graph o asource code () the cyclomatic complexity o source code is de1047297ned as

() = minus + 2 (983090983089)

where is the number o edges o the graph is the numbero nodes and is the number o connected components Telarger is the more complicated source code is

In this paper we measure the complexity o a sourcecode using itsunction-call graph Since a unction-call graphrepresents dependencies among unctions within a programit can be considered as a kind o control 1047298ow graphs whereentities o the source code are the unction in the source codeand the edge implies dependencies between unctions

Let() and() be the cyclomatic complexities o twosource codes and respectively Since is the weight o two normalized kernels it has to normalize between 983088 and 983089Te sigmoid unction is de1047297ned or all real input values andreturns a positive value between 983088 and 983089 Tus the sigmoidunction is adopted or o (983090983088) and is de1047297ned as

= 1

1 + minus(min(()())minus25)

(983090983090)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 913

Mathematical Problems in Engineering 983097

983137983138983148983141 983089 Simple statistics on the real data set

Inormation Value

Number o total assignments 983091983094

Number o submitted source codes 983093983093983093

Average number o submitted codes per assignment 983089983093983092983090

Minimum number o lines in source code 983092983097Maximum number o lines in source code 983090983096983094983091

Average number o lines per source code 983091983088983093983088983095

Minimum number o nodes in source code 983089983090

Maximum number o nodes in source code 983092983092983095

Average number o nodes in source code 983094983092983090983097

Number o marked plagiarism pairs 983089983095983093

where min() returns the minimum value between and According to (983090983090) as the cyclomatic complexity gets larger also increases is to be 983088983093 when thecyclomatic complexity o source code is 983090983093 Tis indicates

that when the cyclomatic complexity o source code is983090983093 the parse tree kernel and the graph kernel have anequal importance in the composite kernel A number o source code analysis applications regard source codes whosecyclomatic complexity is more than 983090983093 as complicated codes(httpmsdnmicrosofcomen-uslibraryms983089983096983090983090983089983090aspx )Tus we set 983090983093 as the equal point o the importance betweenthe parse tree kernel and the graph kernel

7 Experiments

983095983089 Experimental Settings For experiments the same dataset in the work o Son et al [983093] is used Tis data setis collected rom actual programming assignments o Javaclasses submitted by undergraduate students rom 983090983088983088983093 to983090983088983088983097 able 983089 shows simple statistics o the data set Te totalnumber o programming assignments is 983091983094 and the numbero submitted source codes is 983093983093983093 or the 983091983094 assignments Tusthe average number o source codes per an assignment is983089983093983092983090

Figure 983092 shows the histogram o the source codes perlines Te -axis is the number o program lines and the -axis represents the number o source codes As shown in this1047297gure about 983095983093 o source codes are written with less than983092983088983088 lines Te minimum number o lines o a source code is983092983097 and the maximum is 983090983096983094983091 Te average number o lines

per code is 983091983088983093983088983095In our data set the minimum number o unctions within

a program is 983089983090 whereas the maximum number is 983092983092983095Te programs with larger number o programs are paintprograms with any buttons In the paint programs studentsare required to set a layout manually with raw unctionssuch assetBounds Tus paint programs have a number o unctions Te average number o unctions is 983094983092983090983095

wo annotators created the gold standard or this dataset Tey investigated all sourcecodes and markedplagiarizedpairs manually In order to measurethe reliability andvalidity o the annotators Cohenrsquos kappa agreement [983091983096] is measuredTe kappa agreement o the annotators is

= 093 whichalls

180

160

140

120

100

80

60

40

20

00 500 1000 1500 2000 2500 3000

Source code lines

N u m

b e r o f s

o u r c e c o

d e s

F983145983143983157983154983141 983092 Histogram o source lines per code

on the category o ldquoalmost perect agreementrdquo Only the pairs judged as plagiarized pairs by both annotators are regardedas actual plagiarized pairs In total 983089983095983093 pairs are marked asplagiarized pairs

Tree metrics are used as evaluation measure precisionrecall and 1-measure Tey are calculated as ollows

Precision = o correctly detected plagiarized pairs

o detected plagiarized pairs

Recall = o correctly detected plagiarized pairs

o true plagiarized pairs

1-measure = 2 sdot precision

sdot recall

precision + recall (983090983091)

In order to evaluate the proposed method severalbaseline systems are used It is compared with JPlag andCCFinder In all experiments or the parse tree kernel thethreshold or subtree depth Δ is set as 983091 and the decay actortree is 983088983089 Te decay actor graph or graph kernel is set tobe 983088983089 empirically in (983090983089) is set to be 983089 because each sourcecode in our data set is a single program

983095983090 Experimental Results Beore evaluating the perormanceo plagiarism detection we 1047297rst examine relatedness between

the number o source code lines and the cyclomatic complex-ity Tis examination tries to show that (983090983090) is easible Since is determined with cyclomatic complexity it is expected to beproportionalto cyclomatic complexity Figure 983093 shows scatterplot between the number o lines and cyclomatic complexityAs shown in this 1047297gure they are highly correlated with eachother in our data set Te Pearson correlation coefficient is983088983095983089983092 Tis result implies that it is easible to set in (983090983090) to beproportional to cyclomatic complexity

In order to see the effect o threshold in (983089) in ourmethod the perormances are measured according to the

values o Figure 983094 shows the perormance o the pro-posed method or various

As

increases the precision

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1013

983089983088 Mathematical Problems in Engineering

600

500

400

300

200

100

0

C y c

l o m a t i c

c o m p l e x i t y

0 500 1000 1500 2000 2500 3000

Number of lines

F983145983143983157983154983141 983093 Scatter plot according to the number o lines o sourcecodes and the corresponding cyclomatic compelxity

10

09

08

07

06

05

086 088 092090 094 096 098

RecallPrecision

F -measure

Te value of threshold

F983145983143983157983154983141 983094 Perormance o the proposed system or real-world dataset

also increases while the recall decreases slightly Te bestperormance is achieved at = 096 with 983088983096983095 o 1-measureTus

= 096 is used at all the experiments below

Figure 983095 compares the proposed method with variouskernels according to the number o source code lines In this1047297gure the -axis is the number lines o source code andthe -axis represent the average 1-measure As shown inthis 1047297gure the original graph kernel shows the worst per-ormance Since it uses only the graph structure o unctioncalls it ofen ails in calculating the similarity among sourcecodes For example assume that there are two source codesIn one source code main calls a unction add and addcalls another unction multiply In the other source code main calls multiply and multiply calls add Tese twosource codes are the same under the graph kernel since thelabel inormation is ignored by the graph kernel Without

Graph kernel

Parse tree kernel

Modi1047297ed graph kernel

Proposed method

10

08

06

04

02

00

A v e r a g e F 1

- m e a s u r e

lt100 lt200 lt300 lt400 gt400

Number of lines

F983145983143983157983154983141 983095 Average 1-measure according to the number o source

code lines

the labels these two graphs are identical On the other handthe modi1047297ed graph kernel utilizes the label inormation As aresult it achieves better perormance than the graph kernel

Te parse tree kernel achieves higher perormance thanother methods or the source codes with less than 983091983088983088 linesWhen the number o lines in source codes is small theplagiarized codes are ofen made by changing the originalone locally Tus the parse tree kernel detects plagiarizedpairs accurately or the codes with small number o linesWhen source codes have more than 983091983088983088 lines the modi1047297edgraph kernel shows slightly better perormance than theparse tree kernel Tis result implies that high-level structuralinormation is another actor to compare (large) source codesand the modi1047297ed graph kernel can re1047298ect this structuralinormation well

Te proposed method that combines the parse tree kerneland the modi1047297ed graph kernel achieves the best perormanceor all source codes except those with 300 sim 400 linesSince the cyclomatic complexity o source codes with 300 sim400 lines is near 983090983093 the proposed method re1047298ects the parsetree kernel and the modi1047297ed graph kernel equally Tusit achieves an average perormance o the kernels By thecyclomatic complexity o source codes the proposed method

is more in1047298uenced by the parse tree kernel when a sourcecode is small I a source code is large the effect o graphkernel is larger than that o the parse tree kernel From theresults it can be concluded that the proposed method doesnot consider only local-level structural inormation but alsohigh-level structural inormation effectively

Te 1047297nal 1-measure o program source code plagiarismdetection is given in able 983090 Te proposed method showsthe best 1-measure compared to other kernels or opensource plagiarism systems Te difference o 1-measure is983088983090983097 against JPlag 983088983089983095 against CCFinder 983088983088983096 against themodi1047297ed graph and 983088983088983093 against the modi1047297ed parse treekernel Tis result implies that or source code plagiarism

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 7: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 713

Mathematical Problems in Engineering 983095

rFibonacci

iFibonacci

main

initOne

sum

println

F983145983143983157983154983141 983091 Example o extracted call graph rom Java code

parameters are extracted rom the matched subtrees Tenodes or the extracted unction names are connected to thecaller node

Function-call graph o a program depicts how the pro-gram executes at unction level and how unctions are relatedto one another Since the 1047298ow o a program is quite uniqueaccording to the task the similarity between two sources canbe calculated using the 1047298ows o the programs Since this 1047298ow is represented as a unction-call graph the graph kernel is thebest method to compare unction-call graphs It showed goodperormance in several 1047297elds including biology and social

network analysis

983093983090 Graph Kernel Graph kernel is a kernel thatis designed tocompare graph structures Like the parse tree kernel a graphis mapped onto a eature space spanned by their subgraphs inthe graph kernel Te intuitive eature o the graph kernel isgraph isomorphism that determines the topological identityAccording to Gartner et al [983089983088] however it is as hard asdeciding whether two graphs are isomorphic to compute any complete graph kernel with an injective mapping unctionor all graphs where graph isomorphism is a NP-completeproblem [983091983091] Tus most graph kernels ocus on alternativeeature representation o graphs

Te random walk graph kernel is one o the most widely used graph kernels It uses all possible random walks aseatures or graphs Let be a set o all possible random walksandW907317() denotes the set o all possible walks with 1103925 edgesin graph For each random walk isin whose length is 1103925the corresponding eature mapping unction o a graph isgiven as

Φ () = radic907317graph1048699 isinW907317 () forall 9830801103925983081 = 9830801103925983081983165 (983089983090)

where 907317graph is a weight or the length 1103925 and (1103925) and (1103925)are the

th label o the random walk

and

respectively

Te kernel unction between two graphs 1 and 2 denotedby graph(1 2) can be de1047297ned as

graph 9830801 2983081 = sum

Φ 9830801983081 sdotΦ 9830802983081 (983089983091)

Gartner et al proposed an approach to calculate all ran-domwalks within two graphswithout explicit enumeration o all random walks [983089983088] A direct product graph o two graphs1 = (1 1) and 2 = (2 2) denoted by 1 times 2 =(times times) where times is its node set and times is its edge set isde1047297ned as ollows

times 9830801 times 2983081 = 1048699983080V 1 1983081 isin 1 times 2 | 983080V 1983081 = 9830801983081983165 times 9830801 times 2983081 = 1048699983080983080V 1 1983081 983080V 2 2983081983081 isin times 9830801 times 2983081 |

983080V 1 V 2983081 isin 1 9830801 2983081 isin 2

983080V 1

V 2

983081 = 9830801

2

983081983165

(983089983092)

where (V ) is the label o a node V and ( ) is the label o an edge between node and node Based on the directproduct graph the random walk kernel can be calculated

Let 1038389times isin R|times|times|times| denote an adjacency matrix o the directproduct1 times2 With a weighting actorgraph ge 0graph in(983089983091) can be rewritten as

graph 9830801 2983081 = |times|sum1103925=1

1048667infinsum907317=0

907317graph1038389907317times1048669

1103925

(983089983093)

Tis random walk kernel can be computed in (11039253

) usingSylvester equation or Conjugate Gradient method where 1103925 isthe number o nodes [983091983092]

983093983091 Modi1047297ed Graph Kernel When a graph kernel is used tocompare source codes goodperormance is not expected dueto theact that the graph kernel measures similaritiesbetweenwalks with an identical label Since the labels (unctionnames) o nodes within the unction-call graph are decidedby human developers they are seldom identical even i thesource codes are simple Tereore the graph kernel has toconsider nonidentical labels

Borgwardt et al modi1047297ed the random walk kernel to

compare nonidentical labels by changing the direct productgraph to include all pairs o nodes and edges [983091983093] Assumethat nodes are compared by a node kernelnode and edges arecompared by an edge kernel kedge edge Tat is node(V )calculates the similarity between two labels rom the nodes V

and andedge((V 1103925 V 1103925+1)(1103925 1103925+1)) computes the similarity between two edges (V 1103925 V 1103925+1) and (1103925 1103925+1) With these twokernels the random walk kernel between two unction callgraphs 1 and 2 is now de1047297ned as

mg 9830801 2983081 = 907317minus1sum1103925=1

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 (983089983094)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 813

983096 Mathematical Problems in Engineering

where

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081= node 983080V 1103925 1103925983081 sdot node 983080V 1103925+1 1103925+1983081

sdot edge

983080983080V 1103925

V 1103925+1

983081 9830801103925

1103925+1

983081983081

(983089983095)

I this modi1047297ed random walk kernel is used or thecomparison o source code the node kernel node and theedge kernel edge should be de1047297ned Note that the labelso edges in unction-call graph are binary values by (983089983089)Tus edge is simply designed to compare binary valuesTe simplest orm or node(V ) is a unction that returns983089 when V and have similar string patterns 983088 otherwiseTat is it returns 983089 i a distance between V and is smallerthan a prede1047297ned threshold In this paper we simply useLevenshtein distance as the distance and set the threshold as983088983093

Te modi1047297ed random walk kernel mg can be alsocomputed using (983089983093) However the adjacency matrix

1038389times o

the direct product 1 times 2 should be modi1047297ed as

9831311038389times983133(V 10383891038389)(V 907317907317)

= 852091step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081

i 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 isin times0 otherwise(983089983096)

wheretimes isashortormo times(1times2) and the edges (V 1103925 V 1103925+1)and (1103925 1103925+1) belong to 1 and 2 respectively As in parsetree kernel this modi1047297ed graph kernel is used to comparesource codes as similarity measure sim( ) in (983089)

6 Similarity Measure for Source CodesBased on a Composite Kernel

Te modi1047297ed parse tree kernel manages syntactic structuralinormation whereas the modi1047297ed graph kernel considershigh-level topological inormation o source codes In orderto make use o both kinds o inormation the compositiono the two kernels is required Cristianini and Shawe-aylorproved that a new kernel can be obtained by combiningexisting several kernels with some closure properties suchas weighted sum and multiplication [983091983094] Among variousclosure properties this paper adopts the weighted sum sinceit is simple and widely used

Beore combining two kernels the kernels should benormalized since the modi1047297ed parse tree kernel mpt andthe modi1047297ed graph kernelmg are not bound Tereore onekernel can dominate other in their composition In order toremove this effect the kernels are 1047297rst normalized When akernel() is given itsnormalized one()isde1047297nedas

= radic ( ) sdot 983080 983081 (983089983097)

Tereore () is bounded between 983088 and 983089 Tat is 0 le

(

) le 1

Our composite kernelco is composed o the normalizedmodi1047297ed parse tree kernel

mpt and the normalized modi1047297ed

graph kernelmg Tatis the composite kernel

co or given

two source codes and is de1047297ned as

co

= 9830801 minus 983081 sdot mpt

983080

983081 + sdot mg

983080

983081 (983090983088)

where is a mixing weight between two kernels and areparse trees extracted rom source codes and respectivelyand and are call graphs rom and respectively Telarger is the more signi1047297cant mpt is On the other handas the value o gets small the graph kernel mg is moresigni1047297cant than the parse tree kernel mpt Tis compositekernel is used as our 1047297nal similarity measure sim( ) in (983089)

Te parse tree kernel compares source codes with local-level view since it is based on subtree comparison Most pla-giarized source codes change a small portion o the originalsource code Tus the parse tree kernel has shown goodperormances in general However it does not re1047298ect the 1047298ow o the program which is dynamic structural inormationTegraph kernel on the other hand calculates the similarity interms o dynamic high-level view Tus when source codesconsist o a number o unctions the graph kernel achievesreasonable perormance As a result should be determinedby the complexity o source codes since it is a parameter tocontrol the relative importance between the parse tree kerneland the graph kernel

Tere are many methodsthat measure the complexity o asource code One widely-used method is the cyclomatic com-plexity proposed by McCabe [983091983095] Te cyclomatic complexity is a graph-theoretic quantitative metric and measures thenumber o paths within a source code It is simply calculated

using a control 1047298ow graph o a source code where the nodeso the graph correspond to entities o the source code andan (directed) edge between two nodes implies a dependency relation between entities Given the control 1047298ow graph o asource code () the cyclomatic complexity o source code is de1047297ned as

() = minus + 2 (983090983089)

where is the number o edges o the graph is the numbero nodes and is the number o connected components Telarger is the more complicated source code is

In this paper we measure the complexity o a sourcecode using itsunction-call graph Since a unction-call graphrepresents dependencies among unctions within a programit can be considered as a kind o control 1047298ow graphs whereentities o the source code are the unction in the source codeand the edge implies dependencies between unctions

Let() and() be the cyclomatic complexities o twosource codes and respectively Since is the weight o two normalized kernels it has to normalize between 983088 and 983089Te sigmoid unction is de1047297ned or all real input values andreturns a positive value between 983088 and 983089 Tus the sigmoidunction is adopted or o (983090983088) and is de1047297ned as

= 1

1 + minus(min(()())minus25)

(983090983090)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 913

Mathematical Problems in Engineering 983097

983137983138983148983141 983089 Simple statistics on the real data set

Inormation Value

Number o total assignments 983091983094

Number o submitted source codes 983093983093983093

Average number o submitted codes per assignment 983089983093983092983090

Minimum number o lines in source code 983092983097Maximum number o lines in source code 983090983096983094983091

Average number o lines per source code 983091983088983093983088983095

Minimum number o nodes in source code 983089983090

Maximum number o nodes in source code 983092983092983095

Average number o nodes in source code 983094983092983090983097

Number o marked plagiarism pairs 983089983095983093

where min() returns the minimum value between and According to (983090983090) as the cyclomatic complexity gets larger also increases is to be 983088983093 when thecyclomatic complexity o source code is 983090983093 Tis indicates

that when the cyclomatic complexity o source code is983090983093 the parse tree kernel and the graph kernel have anequal importance in the composite kernel A number o source code analysis applications regard source codes whosecyclomatic complexity is more than 983090983093 as complicated codes(httpmsdnmicrosofcomen-uslibraryms983089983096983090983090983089983090aspx )Tus we set 983090983093 as the equal point o the importance betweenthe parse tree kernel and the graph kernel

7 Experiments

983095983089 Experimental Settings For experiments the same dataset in the work o Son et al [983093] is used Tis data setis collected rom actual programming assignments o Javaclasses submitted by undergraduate students rom 983090983088983088983093 to983090983088983088983097 able 983089 shows simple statistics o the data set Te totalnumber o programming assignments is 983091983094 and the numbero submitted source codes is 983093983093983093 or the 983091983094 assignments Tusthe average number o source codes per an assignment is983089983093983092983090

Figure 983092 shows the histogram o the source codes perlines Te -axis is the number o program lines and the -axis represents the number o source codes As shown in this1047297gure about 983095983093 o source codes are written with less than983092983088983088 lines Te minimum number o lines o a source code is983092983097 and the maximum is 983090983096983094983091 Te average number o lines

per code is 983091983088983093983088983095In our data set the minimum number o unctions within

a program is 983089983090 whereas the maximum number is 983092983092983095Te programs with larger number o programs are paintprograms with any buttons In the paint programs studentsare required to set a layout manually with raw unctionssuch assetBounds Tus paint programs have a number o unctions Te average number o unctions is 983094983092983090983095

wo annotators created the gold standard or this dataset Tey investigated all sourcecodes and markedplagiarizedpairs manually In order to measurethe reliability andvalidity o the annotators Cohenrsquos kappa agreement [983091983096] is measuredTe kappa agreement o the annotators is

= 093 whichalls

180

160

140

120

100

80

60

40

20

00 500 1000 1500 2000 2500 3000

Source code lines

N u m

b e r o f s

o u r c e c o

d e s

F983145983143983157983154983141 983092 Histogram o source lines per code

on the category o ldquoalmost perect agreementrdquo Only the pairs judged as plagiarized pairs by both annotators are regardedas actual plagiarized pairs In total 983089983095983093 pairs are marked asplagiarized pairs

Tree metrics are used as evaluation measure precisionrecall and 1-measure Tey are calculated as ollows

Precision = o correctly detected plagiarized pairs

o detected plagiarized pairs

Recall = o correctly detected plagiarized pairs

o true plagiarized pairs

1-measure = 2 sdot precision

sdot recall

precision + recall (983090983091)

In order to evaluate the proposed method severalbaseline systems are used It is compared with JPlag andCCFinder In all experiments or the parse tree kernel thethreshold or subtree depth Δ is set as 983091 and the decay actortree is 983088983089 Te decay actor graph or graph kernel is set tobe 983088983089 empirically in (983090983089) is set to be 983089 because each sourcecode in our data set is a single program

983095983090 Experimental Results Beore evaluating the perormanceo plagiarism detection we 1047297rst examine relatedness between

the number o source code lines and the cyclomatic complex-ity Tis examination tries to show that (983090983090) is easible Since is determined with cyclomatic complexity it is expected to beproportionalto cyclomatic complexity Figure 983093 shows scatterplot between the number o lines and cyclomatic complexityAs shown in this 1047297gure they are highly correlated with eachother in our data set Te Pearson correlation coefficient is983088983095983089983092 Tis result implies that it is easible to set in (983090983090) to beproportional to cyclomatic complexity

In order to see the effect o threshold in (983089) in ourmethod the perormances are measured according to the

values o Figure 983094 shows the perormance o the pro-posed method or various

As

increases the precision

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1013

983089983088 Mathematical Problems in Engineering

600

500

400

300

200

100

0

C y c

l o m a t i c

c o m p l e x i t y

0 500 1000 1500 2000 2500 3000

Number of lines

F983145983143983157983154983141 983093 Scatter plot according to the number o lines o sourcecodes and the corresponding cyclomatic compelxity

10

09

08

07

06

05

086 088 092090 094 096 098

RecallPrecision

F -measure

Te value of threshold

F983145983143983157983154983141 983094 Perormance o the proposed system or real-world dataset

also increases while the recall decreases slightly Te bestperormance is achieved at = 096 with 983088983096983095 o 1-measureTus

= 096 is used at all the experiments below

Figure 983095 compares the proposed method with variouskernels according to the number o source code lines In this1047297gure the -axis is the number lines o source code andthe -axis represent the average 1-measure As shown inthis 1047297gure the original graph kernel shows the worst per-ormance Since it uses only the graph structure o unctioncalls it ofen ails in calculating the similarity among sourcecodes For example assume that there are two source codesIn one source code main calls a unction add and addcalls another unction multiply In the other source code main calls multiply and multiply calls add Tese twosource codes are the same under the graph kernel since thelabel inormation is ignored by the graph kernel Without

Graph kernel

Parse tree kernel

Modi1047297ed graph kernel

Proposed method

10

08

06

04

02

00

A v e r a g e F 1

- m e a s u r e

lt100 lt200 lt300 lt400 gt400

Number of lines

F983145983143983157983154983141 983095 Average 1-measure according to the number o source

code lines

the labels these two graphs are identical On the other handthe modi1047297ed graph kernel utilizes the label inormation As aresult it achieves better perormance than the graph kernel

Te parse tree kernel achieves higher perormance thanother methods or the source codes with less than 983091983088983088 linesWhen the number o lines in source codes is small theplagiarized codes are ofen made by changing the originalone locally Tus the parse tree kernel detects plagiarizedpairs accurately or the codes with small number o linesWhen source codes have more than 983091983088983088 lines the modi1047297edgraph kernel shows slightly better perormance than theparse tree kernel Tis result implies that high-level structuralinormation is another actor to compare (large) source codesand the modi1047297ed graph kernel can re1047298ect this structuralinormation well

Te proposed method that combines the parse tree kerneland the modi1047297ed graph kernel achieves the best perormanceor all source codes except those with 300 sim 400 linesSince the cyclomatic complexity o source codes with 300 sim400 lines is near 983090983093 the proposed method re1047298ects the parsetree kernel and the modi1047297ed graph kernel equally Tusit achieves an average perormance o the kernels By thecyclomatic complexity o source codes the proposed method

is more in1047298uenced by the parse tree kernel when a sourcecode is small I a source code is large the effect o graphkernel is larger than that o the parse tree kernel From theresults it can be concluded that the proposed method doesnot consider only local-level structural inormation but alsohigh-level structural inormation effectively

Te 1047297nal 1-measure o program source code plagiarismdetection is given in able 983090 Te proposed method showsthe best 1-measure compared to other kernels or opensource plagiarism systems Te difference o 1-measure is983088983090983097 against JPlag 983088983089983095 against CCFinder 983088983088983096 against themodi1047297ed graph and 983088983088983093 against the modi1047297ed parse treekernel Tis result implies that or source code plagiarism

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 8: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 813

983096 Mathematical Problems in Engineering

where

step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081= node 983080V 1103925 1103925983081 sdot node 983080V 1103925+1 1103925+1983081

sdot edge

983080983080V 1103925

V 1103925+1

983081 9830801103925

1103925+1

983081983081

(983089983095)

I this modi1047297ed random walk kernel is used or thecomparison o source code the node kernel node and theedge kernel edge should be de1047297ned Note that the labelso edges in unction-call graph are binary values by (983089983089)Tus edge is simply designed to compare binary valuesTe simplest orm or node(V ) is a unction that returns983089 when V and have similar string patterns 983088 otherwiseTat is it returns 983089 i a distance between V and is smallerthan a prede1047297ned threshold In this paper we simply useLevenshtein distance as the distance and set the threshold as983088983093

Te modi1047297ed random walk kernel mg can be alsocomputed using (983089983093) However the adjacency matrix

1038389times o

the direct product 1 times 2 should be modi1047297ed as

9831311038389times983133(V 10383891038389)(V 907317907317)

= 852091step 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081

i 983080983080V 1103925 V 1103925+1983081 9830801103925 1103925+1983081983081 isin times0 otherwise(983089983096)

wheretimes isashortormo times(1times2) and the edges (V 1103925 V 1103925+1)and (1103925 1103925+1) belong to 1 and 2 respectively As in parsetree kernel this modi1047297ed graph kernel is used to comparesource codes as similarity measure sim( ) in (983089)

6 Similarity Measure for Source CodesBased on a Composite Kernel

Te modi1047297ed parse tree kernel manages syntactic structuralinormation whereas the modi1047297ed graph kernel considershigh-level topological inormation o source codes In orderto make use o both kinds o inormation the compositiono the two kernels is required Cristianini and Shawe-aylorproved that a new kernel can be obtained by combiningexisting several kernels with some closure properties suchas weighted sum and multiplication [983091983094] Among variousclosure properties this paper adopts the weighted sum sinceit is simple and widely used

Beore combining two kernels the kernels should benormalized since the modi1047297ed parse tree kernel mpt andthe modi1047297ed graph kernelmg are not bound Tereore onekernel can dominate other in their composition In order toremove this effect the kernels are 1047297rst normalized When akernel() is given itsnormalized one()isde1047297nedas

= radic ( ) sdot 983080 983081 (983089983097)

Tereore () is bounded between 983088 and 983089 Tat is 0 le

(

) le 1

Our composite kernelco is composed o the normalizedmodi1047297ed parse tree kernel

mpt and the normalized modi1047297ed

graph kernelmg Tatis the composite kernel

co or given

two source codes and is de1047297ned as

co

= 9830801 minus 983081 sdot mpt

983080

983081 + sdot mg

983080

983081 (983090983088)

where is a mixing weight between two kernels and areparse trees extracted rom source codes and respectivelyand and are call graphs rom and respectively Telarger is the more signi1047297cant mpt is On the other handas the value o gets small the graph kernel mg is moresigni1047297cant than the parse tree kernel mpt Tis compositekernel is used as our 1047297nal similarity measure sim( ) in (983089)

Te parse tree kernel compares source codes with local-level view since it is based on subtree comparison Most pla-giarized source codes change a small portion o the originalsource code Tus the parse tree kernel has shown goodperormances in general However it does not re1047298ect the 1047298ow o the program which is dynamic structural inormationTegraph kernel on the other hand calculates the similarity interms o dynamic high-level view Tus when source codesconsist o a number o unctions the graph kernel achievesreasonable perormance As a result should be determinedby the complexity o source codes since it is a parameter tocontrol the relative importance between the parse tree kerneland the graph kernel

Tere are many methodsthat measure the complexity o asource code One widely-used method is the cyclomatic com-plexity proposed by McCabe [983091983095] Te cyclomatic complexity is a graph-theoretic quantitative metric and measures thenumber o paths within a source code It is simply calculated

using a control 1047298ow graph o a source code where the nodeso the graph correspond to entities o the source code andan (directed) edge between two nodes implies a dependency relation between entities Given the control 1047298ow graph o asource code () the cyclomatic complexity o source code is de1047297ned as

() = minus + 2 (983090983089)

where is the number o edges o the graph is the numbero nodes and is the number o connected components Telarger is the more complicated source code is

In this paper we measure the complexity o a sourcecode using itsunction-call graph Since a unction-call graphrepresents dependencies among unctions within a programit can be considered as a kind o control 1047298ow graphs whereentities o the source code are the unction in the source codeand the edge implies dependencies between unctions

Let() and() be the cyclomatic complexities o twosource codes and respectively Since is the weight o two normalized kernels it has to normalize between 983088 and 983089Te sigmoid unction is de1047297ned or all real input values andreturns a positive value between 983088 and 983089 Tus the sigmoidunction is adopted or o (983090983088) and is de1047297ned as

= 1

1 + minus(min(()())minus25)

(983090983090)

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 913

Mathematical Problems in Engineering 983097

983137983138983148983141 983089 Simple statistics on the real data set

Inormation Value

Number o total assignments 983091983094

Number o submitted source codes 983093983093983093

Average number o submitted codes per assignment 983089983093983092983090

Minimum number o lines in source code 983092983097Maximum number o lines in source code 983090983096983094983091

Average number o lines per source code 983091983088983093983088983095

Minimum number o nodes in source code 983089983090

Maximum number o nodes in source code 983092983092983095

Average number o nodes in source code 983094983092983090983097

Number o marked plagiarism pairs 983089983095983093

where min() returns the minimum value between and According to (983090983090) as the cyclomatic complexity gets larger also increases is to be 983088983093 when thecyclomatic complexity o source code is 983090983093 Tis indicates

that when the cyclomatic complexity o source code is983090983093 the parse tree kernel and the graph kernel have anequal importance in the composite kernel A number o source code analysis applications regard source codes whosecyclomatic complexity is more than 983090983093 as complicated codes(httpmsdnmicrosofcomen-uslibraryms983089983096983090983090983089983090aspx )Tus we set 983090983093 as the equal point o the importance betweenthe parse tree kernel and the graph kernel

7 Experiments

983095983089 Experimental Settings For experiments the same dataset in the work o Son et al [983093] is used Tis data setis collected rom actual programming assignments o Javaclasses submitted by undergraduate students rom 983090983088983088983093 to983090983088983088983097 able 983089 shows simple statistics o the data set Te totalnumber o programming assignments is 983091983094 and the numbero submitted source codes is 983093983093983093 or the 983091983094 assignments Tusthe average number o source codes per an assignment is983089983093983092983090

Figure 983092 shows the histogram o the source codes perlines Te -axis is the number o program lines and the -axis represents the number o source codes As shown in this1047297gure about 983095983093 o source codes are written with less than983092983088983088 lines Te minimum number o lines o a source code is983092983097 and the maximum is 983090983096983094983091 Te average number o lines

per code is 983091983088983093983088983095In our data set the minimum number o unctions within

a program is 983089983090 whereas the maximum number is 983092983092983095Te programs with larger number o programs are paintprograms with any buttons In the paint programs studentsare required to set a layout manually with raw unctionssuch assetBounds Tus paint programs have a number o unctions Te average number o unctions is 983094983092983090983095

wo annotators created the gold standard or this dataset Tey investigated all sourcecodes and markedplagiarizedpairs manually In order to measurethe reliability andvalidity o the annotators Cohenrsquos kappa agreement [983091983096] is measuredTe kappa agreement o the annotators is

= 093 whichalls

180

160

140

120

100

80

60

40

20

00 500 1000 1500 2000 2500 3000

Source code lines

N u m

b e r o f s

o u r c e c o

d e s

F983145983143983157983154983141 983092 Histogram o source lines per code

on the category o ldquoalmost perect agreementrdquo Only the pairs judged as plagiarized pairs by both annotators are regardedas actual plagiarized pairs In total 983089983095983093 pairs are marked asplagiarized pairs

Tree metrics are used as evaluation measure precisionrecall and 1-measure Tey are calculated as ollows

Precision = o correctly detected plagiarized pairs

o detected plagiarized pairs

Recall = o correctly detected plagiarized pairs

o true plagiarized pairs

1-measure = 2 sdot precision

sdot recall

precision + recall (983090983091)

In order to evaluate the proposed method severalbaseline systems are used It is compared with JPlag andCCFinder In all experiments or the parse tree kernel thethreshold or subtree depth Δ is set as 983091 and the decay actortree is 983088983089 Te decay actor graph or graph kernel is set tobe 983088983089 empirically in (983090983089) is set to be 983089 because each sourcecode in our data set is a single program

983095983090 Experimental Results Beore evaluating the perormanceo plagiarism detection we 1047297rst examine relatedness between

the number o source code lines and the cyclomatic complex-ity Tis examination tries to show that (983090983090) is easible Since is determined with cyclomatic complexity it is expected to beproportionalto cyclomatic complexity Figure 983093 shows scatterplot between the number o lines and cyclomatic complexityAs shown in this 1047297gure they are highly correlated with eachother in our data set Te Pearson correlation coefficient is983088983095983089983092 Tis result implies that it is easible to set in (983090983090) to beproportional to cyclomatic complexity

In order to see the effect o threshold in (983089) in ourmethod the perormances are measured according to the

values o Figure 983094 shows the perormance o the pro-posed method or various

As

increases the precision

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1013

983089983088 Mathematical Problems in Engineering

600

500

400

300

200

100

0

C y c

l o m a t i c

c o m p l e x i t y

0 500 1000 1500 2000 2500 3000

Number of lines

F983145983143983157983154983141 983093 Scatter plot according to the number o lines o sourcecodes and the corresponding cyclomatic compelxity

10

09

08

07

06

05

086 088 092090 094 096 098

RecallPrecision

F -measure

Te value of threshold

F983145983143983157983154983141 983094 Perormance o the proposed system or real-world dataset

also increases while the recall decreases slightly Te bestperormance is achieved at = 096 with 983088983096983095 o 1-measureTus

= 096 is used at all the experiments below

Figure 983095 compares the proposed method with variouskernels according to the number o source code lines In this1047297gure the -axis is the number lines o source code andthe -axis represent the average 1-measure As shown inthis 1047297gure the original graph kernel shows the worst per-ormance Since it uses only the graph structure o unctioncalls it ofen ails in calculating the similarity among sourcecodes For example assume that there are two source codesIn one source code main calls a unction add and addcalls another unction multiply In the other source code main calls multiply and multiply calls add Tese twosource codes are the same under the graph kernel since thelabel inormation is ignored by the graph kernel Without

Graph kernel

Parse tree kernel

Modi1047297ed graph kernel

Proposed method

10

08

06

04

02

00

A v e r a g e F 1

- m e a s u r e

lt100 lt200 lt300 lt400 gt400

Number of lines

F983145983143983157983154983141 983095 Average 1-measure according to the number o source

code lines

the labels these two graphs are identical On the other handthe modi1047297ed graph kernel utilizes the label inormation As aresult it achieves better perormance than the graph kernel

Te parse tree kernel achieves higher perormance thanother methods or the source codes with less than 983091983088983088 linesWhen the number o lines in source codes is small theplagiarized codes are ofen made by changing the originalone locally Tus the parse tree kernel detects plagiarizedpairs accurately or the codes with small number o linesWhen source codes have more than 983091983088983088 lines the modi1047297edgraph kernel shows slightly better perormance than theparse tree kernel Tis result implies that high-level structuralinormation is another actor to compare (large) source codesand the modi1047297ed graph kernel can re1047298ect this structuralinormation well

Te proposed method that combines the parse tree kerneland the modi1047297ed graph kernel achieves the best perormanceor all source codes except those with 300 sim 400 linesSince the cyclomatic complexity o source codes with 300 sim400 lines is near 983090983093 the proposed method re1047298ects the parsetree kernel and the modi1047297ed graph kernel equally Tusit achieves an average perormance o the kernels By thecyclomatic complexity o source codes the proposed method

is more in1047298uenced by the parse tree kernel when a sourcecode is small I a source code is large the effect o graphkernel is larger than that o the parse tree kernel From theresults it can be concluded that the proposed method doesnot consider only local-level structural inormation but alsohigh-level structural inormation effectively

Te 1047297nal 1-measure o program source code plagiarismdetection is given in able 983090 Te proposed method showsthe best 1-measure compared to other kernels or opensource plagiarism systems Te difference o 1-measure is983088983090983097 against JPlag 983088983089983095 against CCFinder 983088983088983096 against themodi1047297ed graph and 983088983088983093 against the modi1047297ed parse treekernel Tis result implies that or source code plagiarism

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 9: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 913

Mathematical Problems in Engineering 983097

983137983138983148983141 983089 Simple statistics on the real data set

Inormation Value

Number o total assignments 983091983094

Number o submitted source codes 983093983093983093

Average number o submitted codes per assignment 983089983093983092983090

Minimum number o lines in source code 983092983097Maximum number o lines in source code 983090983096983094983091

Average number o lines per source code 983091983088983093983088983095

Minimum number o nodes in source code 983089983090

Maximum number o nodes in source code 983092983092983095

Average number o nodes in source code 983094983092983090983097

Number o marked plagiarism pairs 983089983095983093

where min() returns the minimum value between and According to (983090983090) as the cyclomatic complexity gets larger also increases is to be 983088983093 when thecyclomatic complexity o source code is 983090983093 Tis indicates

that when the cyclomatic complexity o source code is983090983093 the parse tree kernel and the graph kernel have anequal importance in the composite kernel A number o source code analysis applications regard source codes whosecyclomatic complexity is more than 983090983093 as complicated codes(httpmsdnmicrosofcomen-uslibraryms983089983096983090983090983089983090aspx )Tus we set 983090983093 as the equal point o the importance betweenthe parse tree kernel and the graph kernel

7 Experiments

983095983089 Experimental Settings For experiments the same dataset in the work o Son et al [983093] is used Tis data setis collected rom actual programming assignments o Javaclasses submitted by undergraduate students rom 983090983088983088983093 to983090983088983088983097 able 983089 shows simple statistics o the data set Te totalnumber o programming assignments is 983091983094 and the numbero submitted source codes is 983093983093983093 or the 983091983094 assignments Tusthe average number o source codes per an assignment is983089983093983092983090

Figure 983092 shows the histogram o the source codes perlines Te -axis is the number o program lines and the -axis represents the number o source codes As shown in this1047297gure about 983095983093 o source codes are written with less than983092983088983088 lines Te minimum number o lines o a source code is983092983097 and the maximum is 983090983096983094983091 Te average number o lines

per code is 983091983088983093983088983095In our data set the minimum number o unctions within

a program is 983089983090 whereas the maximum number is 983092983092983095Te programs with larger number o programs are paintprograms with any buttons In the paint programs studentsare required to set a layout manually with raw unctionssuch assetBounds Tus paint programs have a number o unctions Te average number o unctions is 983094983092983090983095

wo annotators created the gold standard or this dataset Tey investigated all sourcecodes and markedplagiarizedpairs manually In order to measurethe reliability andvalidity o the annotators Cohenrsquos kappa agreement [983091983096] is measuredTe kappa agreement o the annotators is

= 093 whichalls

180

160

140

120

100

80

60

40

20

00 500 1000 1500 2000 2500 3000

Source code lines

N u m

b e r o f s

o u r c e c o

d e s

F983145983143983157983154983141 983092 Histogram o source lines per code

on the category o ldquoalmost perect agreementrdquo Only the pairs judged as plagiarized pairs by both annotators are regardedas actual plagiarized pairs In total 983089983095983093 pairs are marked asplagiarized pairs

Tree metrics are used as evaluation measure precisionrecall and 1-measure Tey are calculated as ollows

Precision = o correctly detected plagiarized pairs

o detected plagiarized pairs

Recall = o correctly detected plagiarized pairs

o true plagiarized pairs

1-measure = 2 sdot precision

sdot recall

precision + recall (983090983091)

In order to evaluate the proposed method severalbaseline systems are used It is compared with JPlag andCCFinder In all experiments or the parse tree kernel thethreshold or subtree depth Δ is set as 983091 and the decay actortree is 983088983089 Te decay actor graph or graph kernel is set tobe 983088983089 empirically in (983090983089) is set to be 983089 because each sourcecode in our data set is a single program

983095983090 Experimental Results Beore evaluating the perormanceo plagiarism detection we 1047297rst examine relatedness between

the number o source code lines and the cyclomatic complex-ity Tis examination tries to show that (983090983090) is easible Since is determined with cyclomatic complexity it is expected to beproportionalto cyclomatic complexity Figure 983093 shows scatterplot between the number o lines and cyclomatic complexityAs shown in this 1047297gure they are highly correlated with eachother in our data set Te Pearson correlation coefficient is983088983095983089983092 Tis result implies that it is easible to set in (983090983090) to beproportional to cyclomatic complexity

In order to see the effect o threshold in (983089) in ourmethod the perormances are measured according to the

values o Figure 983094 shows the perormance o the pro-posed method or various

As

increases the precision

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1013

983089983088 Mathematical Problems in Engineering

600

500

400

300

200

100

0

C y c

l o m a t i c

c o m p l e x i t y

0 500 1000 1500 2000 2500 3000

Number of lines

F983145983143983157983154983141 983093 Scatter plot according to the number o lines o sourcecodes and the corresponding cyclomatic compelxity

10

09

08

07

06

05

086 088 092090 094 096 098

RecallPrecision

F -measure

Te value of threshold

F983145983143983157983154983141 983094 Perormance o the proposed system or real-world dataset

also increases while the recall decreases slightly Te bestperormance is achieved at = 096 with 983088983096983095 o 1-measureTus

= 096 is used at all the experiments below

Figure 983095 compares the proposed method with variouskernels according to the number o source code lines In this1047297gure the -axis is the number lines o source code andthe -axis represent the average 1-measure As shown inthis 1047297gure the original graph kernel shows the worst per-ormance Since it uses only the graph structure o unctioncalls it ofen ails in calculating the similarity among sourcecodes For example assume that there are two source codesIn one source code main calls a unction add and addcalls another unction multiply In the other source code main calls multiply and multiply calls add Tese twosource codes are the same under the graph kernel since thelabel inormation is ignored by the graph kernel Without

Graph kernel

Parse tree kernel

Modi1047297ed graph kernel

Proposed method

10

08

06

04

02

00

A v e r a g e F 1

- m e a s u r e

lt100 lt200 lt300 lt400 gt400

Number of lines

F983145983143983157983154983141 983095 Average 1-measure according to the number o source

code lines

the labels these two graphs are identical On the other handthe modi1047297ed graph kernel utilizes the label inormation As aresult it achieves better perormance than the graph kernel

Te parse tree kernel achieves higher perormance thanother methods or the source codes with less than 983091983088983088 linesWhen the number o lines in source codes is small theplagiarized codes are ofen made by changing the originalone locally Tus the parse tree kernel detects plagiarizedpairs accurately or the codes with small number o linesWhen source codes have more than 983091983088983088 lines the modi1047297edgraph kernel shows slightly better perormance than theparse tree kernel Tis result implies that high-level structuralinormation is another actor to compare (large) source codesand the modi1047297ed graph kernel can re1047298ect this structuralinormation well

Te proposed method that combines the parse tree kerneland the modi1047297ed graph kernel achieves the best perormanceor all source codes except those with 300 sim 400 linesSince the cyclomatic complexity o source codes with 300 sim400 lines is near 983090983093 the proposed method re1047298ects the parsetree kernel and the modi1047297ed graph kernel equally Tusit achieves an average perormance o the kernels By thecyclomatic complexity o source codes the proposed method

is more in1047298uenced by the parse tree kernel when a sourcecode is small I a source code is large the effect o graphkernel is larger than that o the parse tree kernel From theresults it can be concluded that the proposed method doesnot consider only local-level structural inormation but alsohigh-level structural inormation effectively

Te 1047297nal 1-measure o program source code plagiarismdetection is given in able 983090 Te proposed method showsthe best 1-measure compared to other kernels or opensource plagiarism systems Te difference o 1-measure is983088983090983097 against JPlag 983088983089983095 against CCFinder 983088983088983096 against themodi1047297ed graph and 983088983088983093 against the modi1047297ed parse treekernel Tis result implies that or source code plagiarism

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 10: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1013

983089983088 Mathematical Problems in Engineering

600

500

400

300

200

100

0

C y c

l o m a t i c

c o m p l e x i t y

0 500 1000 1500 2000 2500 3000

Number of lines

F983145983143983157983154983141 983093 Scatter plot according to the number o lines o sourcecodes and the corresponding cyclomatic compelxity

10

09

08

07

06

05

086 088 092090 094 096 098

RecallPrecision

F -measure

Te value of threshold

F983145983143983157983154983141 983094 Perormance o the proposed system or real-world dataset

also increases while the recall decreases slightly Te bestperormance is achieved at = 096 with 983088983096983095 o 1-measureTus

= 096 is used at all the experiments below

Figure 983095 compares the proposed method with variouskernels according to the number o source code lines In this1047297gure the -axis is the number lines o source code andthe -axis represent the average 1-measure As shown inthis 1047297gure the original graph kernel shows the worst per-ormance Since it uses only the graph structure o unctioncalls it ofen ails in calculating the similarity among sourcecodes For example assume that there are two source codesIn one source code main calls a unction add and addcalls another unction multiply In the other source code main calls multiply and multiply calls add Tese twosource codes are the same under the graph kernel since thelabel inormation is ignored by the graph kernel Without

Graph kernel

Parse tree kernel

Modi1047297ed graph kernel

Proposed method

10

08

06

04

02

00

A v e r a g e F 1

- m e a s u r e

lt100 lt200 lt300 lt400 gt400

Number of lines

F983145983143983157983154983141 983095 Average 1-measure according to the number o source

code lines

the labels these two graphs are identical On the other handthe modi1047297ed graph kernel utilizes the label inormation As aresult it achieves better perormance than the graph kernel

Te parse tree kernel achieves higher perormance thanother methods or the source codes with less than 983091983088983088 linesWhen the number o lines in source codes is small theplagiarized codes are ofen made by changing the originalone locally Tus the parse tree kernel detects plagiarizedpairs accurately or the codes with small number o linesWhen source codes have more than 983091983088983088 lines the modi1047297edgraph kernel shows slightly better perormance than theparse tree kernel Tis result implies that high-level structuralinormation is another actor to compare (large) source codesand the modi1047297ed graph kernel can re1047298ect this structuralinormation well

Te proposed method that combines the parse tree kerneland the modi1047297ed graph kernel achieves the best perormanceor all source codes except those with 300 sim 400 linesSince the cyclomatic complexity o source codes with 300 sim400 lines is near 983090983093 the proposed method re1047298ects the parsetree kernel and the modi1047297ed graph kernel equally Tusit achieves an average perormance o the kernels By thecyclomatic complexity o source codes the proposed method

is more in1047298uenced by the parse tree kernel when a sourcecode is small I a source code is large the effect o graphkernel is larger than that o the parse tree kernel From theresults it can be concluded that the proposed method doesnot consider only local-level structural inormation but alsohigh-level structural inormation effectively

Te 1047297nal 1-measure o program source code plagiarismdetection is given in able 983090 Te proposed method showsthe best 1-measure compared to other kernels or opensource plagiarism systems Te difference o 1-measure is983088983090983097 against JPlag 983088983089983095 against CCFinder 983088983088983096 against themodi1047297ed graph and 983088983088983093 against the modi1047297ed parse treekernel Tis result implies that or source code plagiarism

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 11: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1113

Mathematical Problems in Engineering 983089983089

983137983138983148983141 983090 Final 1-measure o plagiarism detection

Method 1-measure

JPlag 983088983093983094

CCFinder 983088983095983088

Modi1047297ed parse tree kernel ( = 095) [983093] 983088983096983092

Graph kernel ( = 099) 983088983092983093Modi1047297ed graph kernel ( = 097) 983088983096983090

Proposed method ( = 096) 983088983096983095

detection the similarity measure sim in (983089) should considernot only the syntactic structural inormation but also thedynamic call structure simultaneously

8 Conclusion

In this paper we have proposed a novel method or programsource code comparison Te proposed method calculatesthe

similarity between two source codes with the composition o two kinds o structural inormation extractedrom the sourcecodes Tat is the method uses both syntactic inormationand dynamic inormation Te syntactic inormation whichprovides local-level structural view is included in the parsetree In order to compare the parse trees this paper adoptsa specialized tree kernel or parse trees o source codes Tedynamic inormation which is contained in the unction-callgraph gives high and global level structural view Te graphkernel with the consideration unction names is adopted tore1047298ect the graph structure Finally the proposed methoduses a composite kernel o the kernels to use both kindso inormation In addition the weights o the kernels in

the composite kernel are automatically determined with thecyclomatic complexity

In the experiments o Java program source code plagia-rism detection with real data set it is shown that the proposedmethod outperormed existing methods in detecting plagia-rized pairs In particular the experiments with the variousnumber o lines show that the proposed method alwaysworkswell regardless o the size o source codes

One advantage o the proposed method is that it canbe used with other languages such as C C++ and Pythoneven i the experiments were only conducted with JavaSince the proposed method requires only parse trees andunction-call graphs o source codes it can be applied to any

other languages i a parser or the languages is available Allkinds o inormation o the proposed method are available athttpmlknuackrplagiarism

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Tis study was supported by the BK983090983089 Plus project (SWHuman Resource Development Program or Supporting

Smart Lie) unded by the Ministry o Education Schoolo Computer Science and Engineering Kyungpook NationalUniversity Korea (983090983089A983090983088983089983091983089983094983088983088983088983088983093) and by IC RampD pro-gram o MSIPIIP (983089983088983088983092983092983092983097983092 WiseKB Big data based sel-evolving knowledge base and reasoning platorm)

References

[983089] J-W Son and S-B Park ldquoWeb table discrimination with com-position o rich structural and content inormationrdquo Applied Sof Computing vol 983089983091 no 983089 pp 983092983095ndash983093983095 983090983088983089983091

[983090] DL McCabe ldquoCheating among college and university studentsa north American perspectiverdquo International Journal or Educa-tional Integrity vol 983089 no 983089 pp 983089ndash983089983089 983090983088983088983093

[983091] S Horwitz ldquoIdentiying the semantic and textual differencesbetween two versions o a programrdquo in Proceedings o the ACM SIGPLAN Conerence on Programming Language Design and Implementation pp 983090983091983092ndash983090983092983093 983089983097983097983088

[983092] W Yang ldquoIdentiying syntactic differences between two pro-

gramsrdquo Sofware Practice and Experience vol 983090983089 no 983095 pp 983095983091983097ndash983095983093983093 983089983097983097983089

[983093] J-W Son -G Noh H-J Song and S-B Park ldquoAn applicationor plagiarized source code detection based on a parse treekernelrdquo Engineering Applications o Arti1047297cial Intelligencevol983090983094no 983096 pp 983089983097983089983089ndash983089983097983089983096 983090983088983089983091

[983094] M L Kammer Plagiarism detection in haskell programs using call graph matching [MS thesis] Utrecht University 983090983088983089983089

[983095] D Haussler ldquoConvolution kernels on discrete structuresrdquo echRep UCS-CRL-983097983097-983089983088 University o Caliornia Santa CruzCali USA 983089983097983097983097

[983096] B Scholkop K suda and J-P Vert Kernel Methods inComputational Biology MI Press 983090983088983088983092

[983097] M Collins and N Duffy ldquoConvolution kernels or natural lan-guagerdquo in Advances in Neural Inormation Processing Systemspp 983094983090983093ndash983094983091983090 983090983088983088983089

[983089983088] Gartner P Flach and S Wrobel ldquoOn graph kernels hardnessresults and efficient alternativesrdquo in Proceedings o the 983089983094th Annual Conerence on Learning Teory pp 983089983090983097ndash983089983092983091 August983090983088983088983091

[983089983089] D Hindle ldquoNoun classi1047297cation rom predicate-argument struc-turesrdquo in Proceedings o the 983090983096th Annual Meeting on Association or Computational Linguistics (ACL rsquo983097983088) pp 983090983094983096ndash983090983095983093 Strouds-burg Pa USA June 983089983097983097983088

[983089983090] P Resnik ldquoUsing inormation content to evaluate semanticsimilarity in a taxonomyrdquo in Proceedings o the983089983091th International Joint Conerence on Arti1047297cial Intelligence pp 983092983092983096ndash983092983093983091 983089983097983097983093

[983089983091] B Gipp N Meuschke and C Breitinger ldquoCitation-based pla-giarism detection practicability on a large-scale scienti1047297c cor-pusrdquo Journalo the Association or Inormation Science and ech-nology vol 983094983093 no 983096 pp 983089983093983090983095ndash983089983093983092983088 983090983088983089983092

[983089983092] G Varelas E Voutsakis P Rafopoulou E G Petrakis andE E Milios ldquoSemantic similarity methods in wordnet andtheir application to inormation retrieval on the webrdquo in Pro-ceedings o the 983095th Annual ACM International Workshop on WebInormation and Data Management pp 983089983088ndash983089983094 983090983088983088983093

[983089983093] K Williams H-H Chen andC L GilesldquoClassiying and rank-ing search engine results as potential sources o plagiarismrdquo inProceedings o the ACM Symposium on Document Engineering pp 983097983095ndash983089983088983094 Fort Collins Colo USA September 983090983088983089983092

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 12: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1213

983089983090 Mathematical Problems in Engineering

[983089983094] R A Jarvis and E A Patrick ldquoClustering using a similarity measure based on shared near neighborsrdquo IEEE ransactions onComputers vol 983090983090 no 983089983089 pp 983089983088983090983093ndash983089983088983091983092 983089983097983095983091

[983089983095] K J Ottenstein ldquoAn algorithmic approach to the detection andprevention o plagiarismrdquo ACM SIGCSE Bulletin vol 983096 no 983092pp 983091983088ndash983092983089 983089983097983095983094

[983089983096] M Halstead Elements o Sofware Science Elsevier 983089983097983095983095[983089983097] I D Baxter A Yahin L Moura M SantrsquoAnna and L Bier

ldquoClone detection using abstract syntax treesrdquo in Proceedingso the IEEE International Conerence on Sofware Maintenance(ICSM rsquo983097983096) pp 983091983094983096ndash983091983095983095 November 983089983097983097983096

[983090983088] J Ferrante K J Ottenstein and J D Warren ldquoTe programdependence graph and its use in optimizationrdquo ACM ransac-tions on Programming Languages and Systems vol 983097 no 983091 pp983091983089983097ndash983091983092983097 983089983097983096983095

[983090983089] C Liu C Chen J Han and P S Yu ldquoGplag detection o sofware plagiarism by program dependence graph analysisrdquo inProceedings o the 983089983090th ACM SIGKDD International Conerenceon Knowledge Discovery and Data Mining pp 983096983095983090ndash983096983096983089 983090983088983088983094

[983090983090] H-I Lim H Park S Choi and Han ldquoA method or detecting

the thef o Java programs through analysis o the control 1047298ow inormationrdquo Inormation and Sofware echnology vol 983093983089 no983097 pp 983089983091983091983096ndash983089983091983093983088 983090983088983088983097

[983090983091] D-K Chae J Ha S-W Kim B J Kang and E G ImldquoSofware plagiarism detection a graph-based approachrdquo inProceedings o the 983090983090nd ACM International Conerence onInormation amp Knowledge Management (CIKM rsquo983089983091) pp 983089983093983095983095ndash983089983093983096983088 Burlingame Cali USA November 983090983088983089983091

[983090983092] E Stamatatos ldquoPlagiarism detection using stopword n-gramsrdquo Journal o the American Society or Inormation Science and echnology vol 983094983090 no 983089983090 pp 983090983093983089983090ndash983090983093983090983095 983090983088983089983089

[983090983093] G Cosma and M Joy ldquoAn approach to source-code plagiarismdetection and investigation using latent semantic analysisrdquo

IEEE ransactions on Computers vol 983094983089 no 983091 pp 983091983095983097ndash983091983097983092983090983088983089983090

[983090983094] L Prechelt G Malpohl and M Philippsen ldquoFinding plagia-risms among a set o programs with jplagrdquo Journal o Universal Computer Science vol 983096 no 983089983089 pp 983089983088983089983094ndash983089983088983091983096 983090983088983088983090

[983090983095] A Aiken ldquoMoss a system or detecting sofware plagiarismrdquo983089983097983097983096 httptheorystanordedusimaikenmoss

[983090983096] Kamiya S Kusumoto and K Inoue ldquoCCFinder a multilin-guistic token-based code clone detection system or large scalesource coderdquo IEEE ransactions on Sofware Engineering vol983090983096 no 983095 pp 983094983093983092ndash983094983095983088 983090983088983088983090

[983090983097] J Parr and R W Quong ldquoANLR a predicated-LL(k) parsergeneratorrdquo Sofware Practice and Experience vol 983090983093 no 983095 pp983095983096983097ndash983096983089983088 983089983097983097983093

[983091983088] V N Vapnik Te Nature o StatisticalLearningTeory SpringerNew York NY USA 983089983097983097983093

[983091983089] R Courant and D Hilbert Methods o Mathematical PhysicsInterscience New York NY USA 983089983097983093983091

[983091983090] A Moschitti and F M Zanzotto ldquoFast and effective kernelsor relational learning rom textsrdquo in Proceedings o the 983090983092thInternational Conerence on Machine Learning (ICML rsquo983088983095) pp983094983092983097ndash983094983093983094 Corvallis Ore USA June 983090983088983088983095

[983091983091] M R Garey and D S Johnson Computers and Intractability AGuide to the Teory o NP-Completeness W H Freeman 983089983097983097983088

[983091983092] S V N Vishwanathan N N Schraudolph R Kondor and KM Borgwardt ldquoGraph kernelsrdquo Journal o Machine Learning Research vol 983089983089 pp 983089983090983088983089ndash983089983090983092983090 983090983088983089983088

[983091983093] K M Borgwardt C S Ong S Schonauer S V N Vish-wanathan A J Smola and H-P Kriegel ldquoProtein unction pre-diction via graph kernelsrdquo Bioinormatics vol 983090983089 supplement 983089pp i983092983095ndashi983093983094 983090983088983088983093

[983091983094] N Cristianini and J Shawe-aylor An Introduction to Support Vector Machines Cambridge University Press CambridgeUK983090983088983088983088

[983091983095] J McCabe ldquoA complexity measurerdquo IEEE ransactions onSofware Engineering vol 983090 no 983092 pp 983091983088983096ndash983091983090983088 983089983097983095983094

[983091983096] J Carletta ldquoAssessing agreement on classi1047297cation tasks thekappa statisticrdquo Computational Linguistics vol 983090983090 no 983090 pp983090983092983097ndash983090983093983092 983089983097983097983094

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom

Page 13: journal for scribd

8202019 journal for scribd

httpslidepdfcomreaderfulljournal-for-scribd 1313

Submit your manuscripts at

httpwwwhindawicom