CHAPTER 7 SOURCE CODE ANALYSIS USING DISCRETE WAVELET...

147

CHAPTER 7

SOURCE CODE ANALYSIS USING

DISCRETE WAVELET TRANSFORM

7.1 INTRODUCTION

In this chapter, the use of discrete wavelet transform for source code analysis

and its use in detecting plagiarisms in source code files are discussed in detail. The

two-phase approach discussed in chapter 5 is modified by replacing LSA with DWT

in phase 1 to detect plagiarisms in student programming assignments. In modified

two-phase approach, DWT is employed in first phase to identify distinct clusters of

potentially plagiarized files. In second phase, AST matching itself is used to identify

the plagiarized files. The comparison of files in second phase has to be done only

within the clusters rather than across the clusters. This considerably reduces the

computation involved in the second phase. This two-phase approach is found to be

more effective than applying the algorithms independently and also performs better

than the LSI-AST combined approach.

7.2 WHY DISCRETE WAVELET TRANSFORM?

Apart from LSI, PCA was tested and ICA was analyzed for use in phase 1

instead of LSI. However, PCA failed to give satisfactory results when tested for the

source code files taken as a whole. PCA or ICA can be used effectively to detect

similarities only in code fragments or blocks such as methods, and loops.

7.2.1 Principal Component Analysis

Principal component analysis is a two-mode factor analysis technique. In LSI,

the term-document matrix is decomposed using SVD. In PCA, the original data

matrix is preprocessed to obtain mean-centered data. The data is mapped to a new

multi-dimensional space and the axes or the principal directions are derived such that

variance of data is maximum along these directions. The first axis corresponds to the

148

direction of largest variation; second axis corresponds to the direction of second

largest variation of data and so on. The principal directions of the data matrix are

eigenvectors of its covariance matrix and variances along the new axes are given by

its eigenvalues. Normal PCA is a non-parametric analysis while kernel PCA applied

to non-linearly distributed data is a parametric analysis. PCA gives good results only

when the data is approximately normally distributed and fails for non-Gaussian

distribution. In such cases, ICA gives better results.

7.2.2 Independent Component Analysis

Independent component analysis is a blind signal separation technique that

separates a set of input signals into statistically independent components. That is, ICA

extracts signals that are mutually independent of one another unlike LSI which

focuses on signals that are simply decorrelated. ICA decomposes a source matrix into

two new matrices. One of the matrices describes a number of independent

components, and the other is a mixing matrix that holds information about how the

independent components themselves are combined to produce the original set of

mixed signals.

LSI and PCA find axes which are orthonormal; ICA finds axes which are

linearly independent, not necessarily orthonormal. In other words, ICA finds

biorthogonal axes. In PCA or ICA, the axes onto which data are projected are

discovered from the data itself unlike Fourier analysis where the data are always

projected onto sines and cosines of varying amplitudes and frequencies. The

projection of data onto the rotated co-ordinate system with newly discovered

orthogonal or biorthogonal axes helps to decorrelate the data.

PCA does a second order analysis and the measure used to decorrelate data is

variance; ICA is used for a higher order analysis and non-Guassianity (Kurtosis) is

used as the measure. ICA is based on the fact that maximising non-Guassianity

maximises independence. However, since ICA uses nonlinear optimization, it is

computationally very expensive. Moreover, solution obtained using ICA is not unique

since the measure used is dimensionless. Fast ICA can be used along with Gram-

Schmidt like decorrelation to obtain a set of orthogonal axes with fast convergence.

149

LSI, PCA and ICA which use vector space models for documents suffer from

certain limitations. In vector space model, each vector is comprised of only the

number of occurrences of different terms in a document. No information regarding the

position of terms in the document is retained. Therefore, in most of the cases, vector-

distance calculation alone will not be sufficient to make a decision whether there is

any sort of plagiarism or similarity in the documents. Moreover, algorithms based on

vector space model that perform well and produce good and reliable results for text

document retrieval may give very poor performance when applied to detect

similarities in source code. This contradiction is due to the peculiarities of a source

code file as compared to a normal text document.

7.2.3 Discrete Wavelet Transform

In signal processing applications, wavelet transform allows for localization in

time and frequency domains simultaneously. This property of wavelets has been used

effectively for document retrieval in (Park et al., 2005b). Wavelet transform allows

for retaining information on term position as well as term frequency for each

document.

DWT, even though uses attribute-counting, has proved successful in analyzing

patterns and finding out similarity in documents since it can keep track of postions of

terms involved. The results show that DWT can efficiently be used to trace files

which are similar. Unlike other attribute-counting-based techniques, DWT can help

figure out portions of files which are identical since it can analyze the term signals at

different resolutions.

In this work, DWT is used to analyze and detect similarities in source code

files written in C, C++ and Java. For each language, a list of relevant tokens is

maintained and for each source code file, term signals are generated for each of the

tokens present in the file. A term signal pattern is obtained from the different term

signals that constitute a file. These term signal patterns are processed further to

identify similar files. Term signal pattern for each file is converted into term

spectrum using a spectral (wavelet) transform, and the term spectra properties are

used to obtain a file score. Files with similar scores are grouped into different sets.

Further processing is done on individual sets of files with scores above a threshold.

150

7.3 MODIFIED PHASE 1 BLOCK DIAGRAM OF THE SYSTEM

In the modified two-phase architecture to detect plagiarisms in source code

files, LSI is replaced with DWT as outlined below.

Phase 1 – Clustering Phase

Input: Large database containing source code files

Algorithm Used: DWT

Output: Clusters of potentially plagiarized files

Phase 2 – Comparison Phase

Input: Output of Phase 1

Algorithm Used: AST matching for files in each group

Output: Plagiarized files

Figure 7.1 shows the modified phase 1 diagram of the system which uses

DWT instead of LSI for source code analysis.

Figure 7.1. Modified Phase 1 Block Diagram

151

7.3.1 Lexical Analysis

The lexical analysis phase takes as input the set of source code files to be

compared. During tokenization, all variable names are replaced using the word VAR,

all method names are replaced by the word METHOD and all the constants are

replaced by the word NUM. All the variable declarations, method declarations, loops,

selection statements, normal statements, and statement blocks are identified. Lexical

analysis phase also keeps track of the positions (line numbers) of occurrences of

different tokens in each file. Token-position matrices are generated based on this

information.

7.3.2 Constructing Token-Position Matrices

Similar to the term-document matrix in LSI, DWT uses token-position

matrices which carry information regarding the position and frequency of each token

in each document. In LSI, there is only one TDM for the document corpus. In DWT

based analysis which is being discussed here, there is a token-position matrix for

every source code file. Token-position matrices are constructed from the term signals

generated for each file.

7.3.2.1 Term Signals and Division of Files into Blocks

A term signal shows the occurrence of a particular token in a particular section

of a file. Term signals are constructed by noting the positions wherever the terms

occur. For analyzing source code files using DWT, term signals are generated for

each token in each file.

Figure 7.2 shows example of term occurrence pattern for a term in two files.

Figure 7.2. Illustration of Term Occurrence Patterns for Two Files

152

Dividing Files into Blocks

Files of varying lengths could be divided into blocks of fixed size but varying

number of blocks. The size of blocks for a given set is decided based on the

lengths of files but is fixed for a given set of files. Small files may have lesser

number and large files may have larger number of blocks depending on the

block size chosen.

Another possible division of a file is based on the minimum file length and

maximum file length obtained from the given set of files. Files are divided into

blocks of varying size but fixed in number for a given set of files. The number

of blocks for a given set is decided based on the lengths of files.

A third type of division is followed in this work. The source code file is

logically divided into different blocks based on the information collected

during lexical analysis phase. These blocks are of varying size and are

scattered across the file.

Figure 7.3 shows the term signals obtained when files are divided into different

blocks.

(a)

(b)

Figure 7.3. Term Signals Obtained When Files are Divided into Blocks a)

Division of Files into Blocks of Fixed Size b) Division of Files into Equal Number

of Blocks

153

Figure 7.3.a shows the term signals obtained when files are divided into fixed

size blocks with each block in File1 and File2 consisting of 5 lines of code. Rather

than using word count as done in (Park et al., 2004), line count is used here to divide

the code into blocks.

Figure 7.3.b shows the term signals obtained when files are divided into equal

number of blocks with each block in File1 consisting of 5 lines and each block of

File2 consisting of 10 lines of code.

Table 7.1 shows the division of code within a function into blocks based on

the file contents, for two code fragments. Lines 3 and 4 (variable declarations)

constitute block1, lines 5 and 6 (simple statements) constitute block 2 and 7-10 (loop)

constitute block 3 within the function in transformed codes 1 and 2.

Table 7.1 Division of Transformed Code into Blocks

Transformed Code 1 Transformed Code 2

1.METHOD ( )

2.{

3.INT VAR,VAR=NUM;

4.FLOAT VAR=NUM, VAR=NUM,

VAR=NUM;

5.VAR = VAR;

6.VAR = NUM;

7.WHILE(VAR<=VAR){

8.VAR=VAR*(NUM+VAR);

9.VAR = VAR + NUM;

10.}

11.}

1.METHOD ( )

2.{

3.FLOAT VAR=NUM, VAR=NUM,

VAR=NUM;

4.FLOAT VAR,VAR=NUM;

5.VAR = VAR;

6.VAR = NUM;

7.DO{

8.VAR=VAR*(NUM+VAR);

9.VAR = VAR + NUM;

10.}WHILE(VAR<=VAR);

11.}

154

A term signal is then given by

, ,0 , , , ,, 1 1 d d t d t d t Bt f f ff (7.1)

where fd,t,b is the bth

signal component (0 ≤ b ≤ B-1) of a term signal fd,t for token t in

document d with B blocks. In other words, fd,t,b is the frequency (number of

occurrences) of token t in block b of document d. Each term signal thus contains

information regarding token position as well as token frequency.

Suppose there are n relevant tokens in the file. A term signal pattern is

generated by combining all the term signals for a given file. Token-block matrix Fi

for a file i is obtained by arranging all the term signals for the particular file as:

1 1 1

2 2 2

, ,0 , ,1 , , 1

, ,0 , ,1 , , 1

, ,0 , ,1 , , 1

...

: : : :

...

i i i

i i i

i n i n i n

d t d t d t B

d t d t d t B

i

d t d t d t B

f f f

f f fF

f f f

If the files are divided into different number of blocks, then the dimensions of

the token-block matrices will differ. If the files are divided into blocks of different

sizes, then smaller files will have blocks with very few lines and larger files will have

blocks with large number of lines.

Token-position or token-block matrix entries for the two sample transformed

codes 1 and 2 in table 7.1 are

Transformed Code 1 Transformed Code 2

Token B1 B2 B3 Token B1 B2 B3

INT 1 0 0 INT 0 0 0

FLOAT 1 0 0 FLOAT 2 0 0

WHILE 0 0 1 WHILE 0 0 1

DO 0 0 0 DO 0 0 1

155

Figure 7.4 shows the term signals for INT, FLOAT, WHILE and DO for the two

transformed codes.

(a)

(b)

Figure 7.4. Term Signals for INT, FLOAT, WHILE and DO for: a) Transformed

Code 1 b) Tranformed Code 2 in Table 7.1

However, the block representation only gives information about in which

block does a term occur and how many times. It does not exactly point out the

location of occurrence of a term.

In order to know the exact position of each word, the simplest word vector

would contain elements equal to the word count of the document (Park et al., 2004). A

„1‟ would represent the occurrence of a term in a position and a „0‟ would mark its

absence thereby generating a very large vector. In this work, there is no physical

division of code into blocks. Instead of word count, line count is considered to create

term signals. For each token in a file, its occurrences in each line are noted. For each

file, therefore, a token-line matrix (Token-block or token-line matrix is, in general,

termed as token-position matrix) is generated during lexical analysis phase.

156

As seen in table 7.1, the line numbers are noted during lexical analysis phase.

This gives information about the lines in which a particular term occurs. Based on this

information, a source code file is divided into four different logical blocks such as

declaration section which includes variable declaration, method declaration, class

declaration etc., statement section which includes variable initialization, arithmetic

expressions, and other simple statements, loop section which includes for, while and

do-while loops, selection statement section which includes if-else statement and

switch statement.

The division is logical since a block may not be physically continuous (eg; a

block containing variable declarations) and all files may not have all the four sections.

The line numbers help to keep track on which line in the block has a term occurred

and also to keep track of and organize the blocks scattered across the file. Term signal

patterns are transformed using DWT and spectral values of corresponding sections of

code can be compared to get document scores. For example, the declaration section of

one file needs to be matched only against declaration section of another file. Thus, it

allows selective matching of code.

7.3.2.2 Normalization of Token-Position Matrices

Term weighting schemes used in document retrieval techniques involving

vector space models are used to reduce the impact of document length on the

document score. This condition is applicable to term signals as well. The

normalization is done by computing term frequency-inverse document frequency

transform as given in (Thaicharoen et al., 2008).

, ,

,

,

1 lg( ) lgd t b

d t

d t

f ntfidf f

f DF

(7.2)

where n is the total number of documents and DF is the document frequency.

7.3.3 Applying Discrete Wavelet Transform

DWT is used as an alternative to LSI in phase 1 of the two-phase source code

plagiarism detection system. Different wavelet and scaling filters can be used to

analyze the term signal patterns. These filters transform the term signal patterns into a

different representation, called the term spectra, as already discussed in section 7.2.3.

157

The wavelet spectrum of term t in document d is given by

, , ,0 , ,1 , , 1...d t d t d t d t B (7.3)

where , , , ,

, ,i

d t b d t b

d t bH e

is the bth

spectral component of token t in document d with

magnitude , ,d t bH and phase

, ,d t b .

To perform DWT using Haar wavelet, the 2×2 Haar wavelet transform matrix

is 2

1 11

1 12H

, where the first row represents the low pass filter and second row

represents the high pass filter. Each term signal , ,0 , ,1 , , 1 d t d t d t Bf f f , where B

is equal to the line count of document d since there is no division of blocks, is transformed

into a sequence of two-component-vectors as

, ,0 , ,1 , ,2 , ,3 , ,2 2 , ,2 1 , , ,

d t d t d t d t d t k d t kf f f f f f

, if B is even (say, B=2k). If B is odd,

either the last value can be dropped or a zero can be padded to make it an even-length

sequence. Right-multiplying each vector with the matrix H2, gives the result

0 0 1 1 1 1, , , k ks d s d s d , where si‟s are the smooth coefficients and

di‟s are the detail coefficients obtained after first level decomposition using discrete

Haar wavelet transform. Smooth coefficients are stored in the left part of the signal

and detail coefficients are stored in the right part of the signal. Further decomposition

is done by applying the Haar scaling and wavelet filters again and again on the

smooth coefficients.

Consider plagiarized files HelloWorld.c and HelloWorld1.c.

HelloWorld.c HelloWorld1.c

#include<stdio.h>

main()

{

printf("Hello World");

}

#include<stdio.h>

main()

{

char c = ' ';

printf("Hello World");

}

158

(a)

(b)

(c)

Figure 7.5. Term Signals and their Wavelet Spectra After First Level of

Decomposition using Normalized Haar Scaling and Wavelet Filters a) Term

Signal for char and include in Files HelloWorld.c and HelloWorld1.c

Respectively b) Magnitude Components of the Term Spectra c) Phase

Components of the Term Spectra

159

Figure 7.5 shows the term signals for char and include in files HelloWorld.c

and HelloWorld1.c and their corresponding wavelet spectra obtained after first level

of decomposition using normalized Haar scaling and wavelet filters.

7.3.4 Computing File Scores

On applying DWT on the weighted term signal patterns, term spectra are

obtained. Each term spectrum has magnitude and phase information. Both the

magnitude and phase information of the term spectra are used to compute a score for

each file based on which the clustering of potentially plagiarized files is done.

The magnitude and phase information obtained from the wavelet spectra of the

term signals can be used in different ways to compute the document scores (Park et

al., 2004). Here, magnitude and phase are examined separately. A magnitude vector is

formed by adding the corresponding magnitude components for all the tokens in a

file. A phase precision vector is obtained by first assigning each phase to a unit

vector. The corresponding components of the vectors are added and the magnitude is

averaged to get the phase precision value. If all of the phases are the same, the unit

vectors will add constructively and the resulting magnitude will be 1. If the phases are

scattered, the unit vectors will add destructively and the resulting magnitude will be

close to zero (Park et al., 2005b).

Phase precision ,d b of bth spectral component in document d is given by

, ,

,#

d t b

t Td b

T

(7.4)

where , ,d t b is the unit phase of bth

spectral component of token t in document d and

#T is the number of distinct tokens in document d. , ,d t b is given by

, ,

, ,

, ,

, ,id t b

d t b

d t b

d t be

(7.5)

A score vector is formed from the magnitude and phase precision vectors. The score

,d bs of bth spectral component in document d is given by

160

, , , ,d b d b d t b

t T

s H

(7.6)

A single score dS for document d is obtained by adding the components of score

vector.

1

,

0

B

d d b

b

S s

(7.7)

For each file, scores are also calculated separately for the different sections

and these are compared against corresponding scores of other documents. Clustering

is done based on these two results.

The features of term signal patterns can directly be used to compute the file

scores based on which the files can be clustered. However, DWT helps to analyze the

signal patterns at different resolutions. This capability of DWT is utilized here and

hence, a more reliable score is computed using the transformed signal patterns.

7.3.5 Processing of File Clusters in Phase 2

Phase 2 processes the file clusters obtained from phase 1. Here, the

comparison is done within the clusters rather than across the clusters based on the file

scores. ASTs are generated for files with similar scores or files identified as

potentially plagiarized in phase 1 and a preorder traversal is done to obtain the

corresponding node sequences. Sequence matching algorithms are then applied to

identify the plagiarized files. AST matching is used successfully to identify

plagiarized files with high precision and recall. It can figure out the portions of code

which are plagiarized.

7.4 IMPLEMENTATION AND TESTING

7.4.1 Clustering using DWT

In vector space analysis, each file is treated as a point in the n-dimensional

term space.

161

Consider two pairs of plagiarized Java files (from Database CEN – Java (Set

1)). DateandTime1.java and DateandTime2.java form a pair of plagiarized files.

sorting3.java and sorting4.java form another pair of plagiarized files. Considering the

number of occurrences of all the tokens in each file as given in table 7.2, each file is

represented as a discrete signal in figure 7.6.

Table 7.2 Term Frequencies for Four Java Files

Terms

Term Frequency

DateandTime1.

java

DateandTime2.

java

sorting3.

java

sorting4.

java

boolean 1 0 1 1

catch 3 3 0 0

class 1 1 1 1

else 2 0 2 2

FALSE 3 2 1 1

for 0 0 15 15

if 4 0 7 7

import 5 5 0 0

int 4 3 11 11

new 7 6 3 3

null 3 3 0 0

public 10 8 7 7

return 2 0 0 0

static 9 7 6 6

TRUE 1 0 2 2

try 2 2 0 0

void 8 7 6 6

while 0 0 4 4

162

Figure 7.6. Discrete Signal Representation for Two Pairs of Plagiarized Java

Files

In DWT based analysis, rather than treating each file as a signal, each file is

considered as a collection of term signals. Figure 7.7 shows the term signal patterns

for two pairs of plagiarized Java source code files.

Figure 7.7. (a)

163

Figure 7.7. (b)

Figure 7.7. (c)

164

(d)

Figure 7.7. Term Signal Patterns for Two Pairs of Plagiarized Java Files a)

DateandTime1.java b) DateandTime2.java c) sorting3.java d) sorting4.java

It can be noted from figure 7.7 that the term signal patterns for the plagiarized

files are identical. Figure 7.8 shows the term spectra obtained after four levels of

decomposition using Daubechies 4-tap scaling and wavelet filters.

Figure 7.8. (a)

165

Figure 7.8. (b)

Figure 7.8. (c)

166

(d)

Figure 7.8. Term Spectra Obtained after Four Levels of Decomposition using

Daubechies 4-Tap Scaling and Wavelet Functions a) DateandTime1.java b)

DateandTime2.java c) sorting3.java d) sorting4.java

Figure 7.9 shows the clustering of plagiarized files: 1) DateandTime1.java 2)

DateandTime2.java 3) sorting3.java 4) sorting4.java. File scores can be computed in

different ways. Here, phase is also taken into consideration to calculate the score.

Figure 7.9. Clustering of Two Pairs of Plagiarized Java Files

167

Consider a set of 9 files consisting of 4 sets of plagiarized C files as given in

table 7.3. Grouping of these files based on the file scores obtained using equation 7.7

after first level decomposition using unnormalized Haar scaling and wavelet filters is

shown in figure 7.10.

Table 7.3 C Fileset 1 (Database CEN – C)

Set File No. File Name

Set 1 1

2

DerivativesThreePoint.c

DerivativesThreePoint_T3.c

Set 2 3

4

amountFunction.c

amountFunction_T3.c

Set 3 5

6

7

binToHex.c

binToHex_T1.c

binToHex_T3.c

Set 4 8

9

calculator.c

calculator_T3.c

Figure 7.10. Plagiarized Files in C Fileset 1 Grouped into Clusters Based on File

Scores Obtained after First Level Decomposition using Un-normalized Haar

Scaling and Wavelet Filters

168

The plagiarized files have similar scores and thus can be grouped into distinct

clusters. Using LSI, file-file similarity scores(for all possible pairs of files in the

database) are obtained which enable us to identify the percentage upto which two

given programs in the database are similar. DWT allow us to form clusters of similar

files. When there are a large number of files, a single cluster will contain files of

varying functionality. However, the plagiarized files are grouped into the same

cluster. A cluster may contain different sets of plagiarized files with similar file

scores. It may also contain files of different functionality which are not plagiarized.

These are identified only after a file-to-file comparison using AST matching.

7.4.2 System Evaluation and Performance

A comparison of the running times calculated when the combined approach

employing DWT and AST is used and when the combined two-phase approach

employing LSI and AST is used to detect similarities in source code files written in C

is shown in figure 7.11.

Figure 7.11. Performance Plots of Combined LSI and AST and Combined DWT

and AST Approaches

The use of Daub-4 wavelet in phase 1 has given better results than the use of

Haar wavelet. It can be noted from figure 7.12 that the precision and recall on using

Daub-4 wavelet in phase 1 is better as compared to that of LSI or Haar wavelet.

169

(a)

(b)

Figure 7.12. Precision and Recall Curves Obtained on Applyinging LSI, Haar

Wavelet Transform, and Daub-4 Wavelet Transform in Phase 1 a) Precision b)

Recall

170

7.5 CONCLUSIONS

The two-phase architecture for source code plagiarism detection using DWT

and AST matching has proved to be more efficient than that using LSI and AST

matching. DWT is used only for an initial screening of files. Though more efficient

than LSA, the efficiency and performance of DWT based source code analysis highly

depends on the selection of wavelet and the level of decomposition.

CHAPTER 7 SOURCE CODE ANALYSIS USING DISCRETE WAVELET...

Documents

Transcript of CHAPTER 7 SOURCE CODE ANALYSIS USING DISCRETE WAVELET...