CHAPTER 7 SOURCE CODE ANALYSIS USING DISCRETE WAVELET...
Transcript of CHAPTER 7 SOURCE CODE ANALYSIS USING DISCRETE WAVELET...
147
CHAPTER 7
SOURCE CODE ANALYSIS USING
DISCRETE WAVELET TRANSFORM
7.1 INTRODUCTION
In this chapter, the use of discrete wavelet transform for source code analysis
and its use in detecting plagiarisms in source code files are discussed in detail. The
two-phase approach discussed in chapter 5 is modified by replacing LSA with DWT
in phase 1 to detect plagiarisms in student programming assignments. In modified
two-phase approach, DWT is employed in first phase to identify distinct clusters of
potentially plagiarized files. In second phase, AST matching itself is used to identify
the plagiarized files. The comparison of files in second phase has to be done only
within the clusters rather than across the clusters. This considerably reduces the
computation involved in the second phase. This two-phase approach is found to be
more effective than applying the algorithms independently and also performs better
than the LSI-AST combined approach.
7.2 WHY DISCRETE WAVELET TRANSFORM?
Apart from LSI, PCA was tested and ICA was analyzed for use in phase 1
instead of LSI. However, PCA failed to give satisfactory results when tested for the
source code files taken as a whole. PCA or ICA can be used effectively to detect
similarities only in code fragments or blocks such as methods, and loops.
7.2.1 Principal Component Analysis
Principal component analysis is a two-mode factor analysis technique. In LSI,
the term-document matrix is decomposed using SVD. In PCA, the original data
matrix is preprocessed to obtain mean-centered data. The data is mapped to a new
multi-dimensional space and the axes or the principal directions are derived such that
variance of data is maximum along these directions. The first axis corresponds to the
148
direction of largest variation; second axis corresponds to the direction of second
largest variation of data and so on. The principal directions of the data matrix are
eigenvectors of its covariance matrix and variances along the new axes are given by
its eigenvalues. Normal PCA is a non-parametric analysis while kernel PCA applied
to non-linearly distributed data is a parametric analysis. PCA gives good results only
when the data is approximately normally distributed and fails for non-Gaussian
distribution. In such cases, ICA gives better results.
7.2.2 Independent Component Analysis
Independent component analysis is a blind signal separation technique that
separates a set of input signals into statistically independent components. That is, ICA
extracts signals that are mutually independent of one another unlike LSI which
focuses on signals that are simply decorrelated. ICA decomposes a source matrix into
two new matrices. One of the matrices describes a number of independent
components, and the other is a mixing matrix that holds information about how the
independent components themselves are combined to produce the original set of
mixed signals.
LSI and PCA find axes which are orthonormal; ICA finds axes which are
linearly independent, not necessarily orthonormal. In other words, ICA finds
biorthogonal axes. In PCA or ICA, the axes onto which data are projected are
discovered from the data itself unlike Fourier analysis where the data are always
projected onto sines and cosines of varying amplitudes and frequencies. The
projection of data onto the rotated co-ordinate system with newly discovered
orthogonal or biorthogonal axes helps to decorrelate the data.
PCA does a second order analysis and the measure used to decorrelate data is
variance; ICA is used for a higher order analysis and non-Guassianity (Kurtosis) is
used as the measure. ICA is based on the fact that maximising non-Guassianity
maximises independence. However, since ICA uses nonlinear optimization, it is
computationally very expensive. Moreover, solution obtained using ICA is not unique
since the measure used is dimensionless. Fast ICA can be used along with Gram-
Schmidt like decorrelation to obtain a set of orthogonal axes with fast convergence.
149
LSI, PCA and ICA which use vector space models for documents suffer from
certain limitations. In vector space model, each vector is comprised of only the
number of occurrences of different terms in a document. No information regarding the
position of terms in the document is retained. Therefore, in most of the cases, vector-
distance calculation alone will not be sufficient to make a decision whether there is
any sort of plagiarism or similarity in the documents. Moreover, algorithms based on
vector space model that perform well and produce good and reliable results for text
document retrieval may give very poor performance when applied to detect
similarities in source code. This contradiction is due to the peculiarities of a source
code file as compared to a normal text document.
7.2.3 Discrete Wavelet Transform
In signal processing applications, wavelet transform allows for localization in
time and frequency domains simultaneously. This property of wavelets has been used
effectively for document retrieval in (Park et al., 2005b). Wavelet transform allows
for retaining information on term position as well as term frequency for each
document.
DWT, even though uses attribute-counting, has proved successful in analyzing
patterns and finding out similarity in documents since it can keep track of postions of
terms involved. The results show that DWT can efficiently be used to trace files
which are similar. Unlike other attribute-counting-based techniques, DWT can help
figure out portions of files which are identical since it can analyze the term signals at
different resolutions.
In this work, DWT is used to analyze and detect similarities in source code
files written in C, C++ and Java. For each language, a list of relevant tokens is
maintained and for each source code file, term signals are generated for each of the
tokens present in the file. A term signal pattern is obtained from the different term
signals that constitute a file. These term signal patterns are processed further to
identify similar files. Term signal pattern for each file is converted into term
spectrum using a spectral (wavelet) transform, and the term spectra properties are
used to obtain a file score. Files with similar scores are grouped into different sets.
Further processing is done on individual sets of files with scores above a threshold.
150
7.3 MODIFIED PHASE 1 BLOCK DIAGRAM OF THE SYSTEM
In the modified two-phase architecture to detect plagiarisms in source code
files, LSI is replaced with DWT as outlined below.
Phase 1 – Clustering Phase
Input: Large database containing source code files
Algorithm Used: DWT
Output: Clusters of potentially plagiarized files
Phase 2 – Comparison Phase
Input: Output of Phase 1
Algorithm Used: AST matching for files in each group
Output: Plagiarized files
Figure 7.1 shows the modified phase 1 diagram of the system which uses
DWT instead of LSI for source code analysis.
Figure 7.1. Modified Phase 1 Block Diagram
151
7.3.1 Lexical Analysis
The lexical analysis phase takes as input the set of source code files to be
compared. During tokenization, all variable names are replaced using the word VAR,
all method names are replaced by the word METHOD and all the constants are
replaced by the word NUM. All the variable declarations, method declarations, loops,
selection statements, normal statements, and statement blocks are identified. Lexical
analysis phase also keeps track of the positions (line numbers) of occurrences of
different tokens in each file. Token-position matrices are generated based on this
information.
7.3.2 Constructing Token-Position Matrices
Similar to the term-document matrix in LSI, DWT uses token-position
matrices which carry information regarding the position and frequency of each token
in each document. In LSI, there is only one TDM for the document corpus. In DWT
based analysis which is being discussed here, there is a token-position matrix for
every source code file. Token-position matrices are constructed from the term signals
generated for each file.
7.3.2.1 Term Signals and Division of Files into Blocks
A term signal shows the occurrence of a particular token in a particular section
of a file. Term signals are constructed by noting the positions wherever the terms
occur. For analyzing source code files using DWT, term signals are generated for
each token in each file.
Figure 7.2 shows example of term occurrence pattern for a term in two files.
Figure 7.2. Illustration of Term Occurrence Patterns for Two Files
152
Dividing Files into Blocks
Files of varying lengths could be divided into blocks of fixed size but varying
number of blocks. The size of blocks for a given set is decided based on the
lengths of files but is fixed for a given set of files. Small files may have lesser
number and large files may have larger number of blocks depending on the
block size chosen.
Another possible division of a file is based on the minimum file length and
maximum file length obtained from the given set of files. Files are divided into
blocks of varying size but fixed in number for a given set of files. The number
of blocks for a given set is decided based on the lengths of files.
A third type of division is followed in this work. The source code file is
logically divided into different blocks based on the information collected
during lexical analysis phase. These blocks are of varying size and are
scattered across the file.
Figure 7.3 shows the term signals obtained when files are divided into different
blocks.
(a)
(b)
Figure 7.3. Term Signals Obtained When Files are Divided into Blocks a)
Division of Files into Blocks of Fixed Size b) Division of Files into Equal Number
of Blocks
153
Figure 7.3.a shows the term signals obtained when files are divided into fixed
size blocks with each block in File1 and File2 consisting of 5 lines of code. Rather
than using word count as done in (Park et al., 2004), line count is used here to divide
the code into blocks.
Figure 7.3.b shows the term signals obtained when files are divided into equal
number of blocks with each block in File1 consisting of 5 lines and each block of
File2 consisting of 10 lines of code.
Table 7.1 shows the division of code within a function into blocks based on
the file contents, for two code fragments. Lines 3 and 4 (variable declarations)
constitute block1, lines 5 and 6 (simple statements) constitute block 2 and 7-10 (loop)
constitute block 3 within the function in transformed codes 1 and 2.
Table 7.1 Division of Transformed Code into Blocks
Transformed Code 1 Transformed Code 2
1.METHOD ( )
2.{
3.INT VAR,VAR=NUM;
4.FLOAT VAR=NUM, VAR=NUM,
VAR=NUM;
5.VAR = VAR;
6.VAR = NUM;
7.WHILE(VAR<=VAR){
8.VAR=VAR*(NUM+VAR);
9.VAR = VAR + NUM;
10.}
11.}
1.METHOD ( )
2.{
3.FLOAT VAR=NUM, VAR=NUM,
VAR=NUM;
4.FLOAT VAR,VAR=NUM;
5.VAR = VAR;
6.VAR = NUM;
7.DO{
8.VAR=VAR*(NUM+VAR);
9.VAR = VAR + NUM;
10.}WHILE(VAR<=VAR);
11.}
154
A term signal is then given by
, ,0 , , , ,, 1 1 d d t d t d t Bt f f ff (7.1)
where fd,t,b is the bth
signal component (0 ≤ b ≤ B-1) of a term signal fd,t for token t in
document d with B blocks. In other words, fd,t,b is the frequency (number of
occurrences) of token t in block b of document d. Each term signal thus contains
information regarding token position as well as token frequency.
Suppose there are n relevant tokens in the file. A term signal pattern is
generated by combining all the term signals for a given file. Token-block matrix Fi
for a file i is obtained by arranging all the term signals for the particular file as:
1 1 1
2 2 2
, ,0 , ,1 , , 1
, ,0 , ,1 , , 1
, ,0 , ,1 , , 1
...
: : : :
...
i i i
i i i
i n i n i n
d t d t d t B
d t d t d t B
i
d t d t d t B
f f f
f f fF
f f f
If the files are divided into different number of blocks, then the dimensions of
the token-block matrices will differ. If the files are divided into blocks of different
sizes, then smaller files will have blocks with very few lines and larger files will have
blocks with large number of lines.
Token-position or token-block matrix entries for the two sample transformed
codes 1 and 2 in table 7.1 are
Transformed Code 1 Transformed Code 2
Token B1 B2 B3 Token B1 B2 B3
INT 1 0 0 INT 0 0 0
FLOAT 1 0 0 FLOAT 2 0 0
WHILE 0 0 1 WHILE 0 0 1
DO 0 0 0 DO 0 0 1
155
Figure 7.4 shows the term signals for INT, FLOAT, WHILE and DO for the two
transformed codes.
(a)
(b)
Figure 7.4. Term Signals for INT, FLOAT, WHILE and DO for: a) Transformed
Code 1 b) Tranformed Code 2 in Table 7.1
However, the block representation only gives information about in which
block does a term occur and how many times. It does not exactly point out the
location of occurrence of a term.
In order to know the exact position of each word, the simplest word vector
would contain elements equal to the word count of the document (Park et al., 2004). A
„1‟ would represent the occurrence of a term in a position and a „0‟ would mark its
absence thereby generating a very large vector. In this work, there is no physical
division of code into blocks. Instead of word count, line count is considered to create
term signals. For each token in a file, its occurrences in each line are noted. For each
file, therefore, a token-line matrix (Token-block or token-line matrix is, in general,
termed as token-position matrix) is generated during lexical analysis phase.
156
As seen in table 7.1, the line numbers are noted during lexical analysis phase.
This gives information about the lines in which a particular term occurs. Based on this
information, a source code file is divided into four different logical blocks such as
declaration section which includes variable declaration, method declaration, class
declaration etc., statement section which includes variable initialization, arithmetic
expressions, and other simple statements, loop section which includes for, while and
do-while loops, selection statement section which includes if-else statement and
switch statement.
The division is logical since a block may not be physically continuous (eg; a
block containing variable declarations) and all files may not have all the four sections.
The line numbers help to keep track on which line in the block has a term occurred
and also to keep track of and organize the blocks scattered across the file. Term signal
patterns are transformed using DWT and spectral values of corresponding sections of
code can be compared to get document scores. For example, the declaration section of
one file needs to be matched only against declaration section of another file. Thus, it
allows selective matching of code.
7.3.2.2 Normalization of Token-Position Matrices
Term weighting schemes used in document retrieval techniques involving
vector space models are used to reduce the impact of document length on the
document score. This condition is applicable to term signals as well. The
normalization is done by computing term frequency-inverse document frequency
transform as given in (Thaicharoen et al., 2008).
, ,
,
,
1 lg( ) lgd t b
d t
d t
f ntfidf f
f DF
(7.2)
where n is the total number of documents and DF is the document frequency.
7.3.3 Applying Discrete Wavelet Transform
DWT is used as an alternative to LSI in phase 1 of the two-phase source code
plagiarism detection system. Different wavelet and scaling filters can be used to
analyze the term signal patterns. These filters transform the term signal patterns into a
different representation, called the term spectra, as already discussed in section 7.2.3.
157
The wavelet spectrum of term t in document d is given by
, , ,0 , ,1 , , 1...d t d t d t d t B (7.3)
where , , , ,
, ,i
d t b d t b
d t bH e
is the bth
spectral component of token t in document d with
magnitude , ,d t bH and phase
, ,d t b .
To perform DWT using Haar wavelet, the 2×2 Haar wavelet transform matrix
is 2
1 11
1 12H
, where the first row represents the low pass filter and second row
represents the high pass filter. Each term signal , ,0 , ,1 , , 1 d t d t d t Bf f f , where B
is equal to the line count of document d since there is no division of blocks, is transformed
into a sequence of two-component-vectors as
, ,0 , ,1 , ,2 , ,3 , ,2 2 , ,2 1 , , ,
d t d t d t d t d t k d t kf f f f f f
, if B is even (say, B=2k). If B is odd,
either the last value can be dropped or a zero can be padded to make it an even-length
sequence. Right-multiplying each vector with the matrix H2, gives the result
0 0 1 1 1 1, , , k ks d s d s d , where si‟s are the smooth coefficients and
di‟s are the detail coefficients obtained after first level decomposition using discrete
Haar wavelet transform. Smooth coefficients are stored in the left part of the signal
and detail coefficients are stored in the right part of the signal. Further decomposition
is done by applying the Haar scaling and wavelet filters again and again on the
smooth coefficients.
Consider plagiarized files HelloWorld.c and HelloWorld1.c.
HelloWorld.c HelloWorld1.c
#include<stdio.h>
main()
{
printf("Hello World");
}
#include<stdio.h>
main()
{
char c = ' ';
printf("Hello World");
}
158
(a)
(b)
(c)
Figure 7.5. Term Signals and their Wavelet Spectra After First Level of
Decomposition using Normalized Haar Scaling and Wavelet Filters a) Term
Signal for char and include in Files HelloWorld.c and HelloWorld1.c
Respectively b) Magnitude Components of the Term Spectra c) Phase
Components of the Term Spectra
159
Figure 7.5 shows the term signals for char and include in files HelloWorld.c
and HelloWorld1.c and their corresponding wavelet spectra obtained after first level
of decomposition using normalized Haar scaling and wavelet filters.
7.3.4 Computing File Scores
On applying DWT on the weighted term signal patterns, term spectra are
obtained. Each term spectrum has magnitude and phase information. Both the
magnitude and phase information of the term spectra are used to compute a score for
each file based on which the clustering of potentially plagiarized files is done.
The magnitude and phase information obtained from the wavelet spectra of the
term signals can be used in different ways to compute the document scores (Park et
al., 2004). Here, magnitude and phase are examined separately. A magnitude vector is
formed by adding the corresponding magnitude components for all the tokens in a
file. A phase precision vector is obtained by first assigning each phase to a unit
vector. The corresponding components of the vectors are added and the magnitude is
averaged to get the phase precision value. If all of the phases are the same, the unit
vectors will add constructively and the resulting magnitude will be 1. If the phases are
scattered, the unit vectors will add destructively and the resulting magnitude will be
close to zero (Park et al., 2005b).
Phase precision ,d b of bth spectral component in document d is given by
, ,
,#
d t b
t Td b
T
(7.4)
where , ,d t b is the unit phase of bth
spectral component of token t in document d and
#T is the number of distinct tokens in document d. , ,d t b is given by
, ,
, ,
, ,
, ,id t b
d t b
d t b
d t be
(7.5)
A score vector is formed from the magnitude and phase precision vectors. The score
,d bs of bth spectral component in document d is given by
160
, , , ,d b d b d t b
t T
s H
(7.6)
A single score dS for document d is obtained by adding the components of score
vector.
1
,
0
B
d d b
b
S s
(7.7)
For each file, scores are also calculated separately for the different sections
and these are compared against corresponding scores of other documents. Clustering
is done based on these two results.
The features of term signal patterns can directly be used to compute the file
scores based on which the files can be clustered. However, DWT helps to analyze the
signal patterns at different resolutions. This capability of DWT is utilized here and
hence, a more reliable score is computed using the transformed signal patterns.
7.3.5 Processing of File Clusters in Phase 2
Phase 2 processes the file clusters obtained from phase 1. Here, the
comparison is done within the clusters rather than across the clusters based on the file
scores. ASTs are generated for files with similar scores or files identified as
potentially plagiarized in phase 1 and a preorder traversal is done to obtain the
corresponding node sequences. Sequence matching algorithms are then applied to
identify the plagiarized files. AST matching is used successfully to identify
plagiarized files with high precision and recall. It can figure out the portions of code
which are plagiarized.
7.4 IMPLEMENTATION AND TESTING
7.4.1 Clustering using DWT
In vector space analysis, each file is treated as a point in the n-dimensional
term space.
161
Consider two pairs of plagiarized Java files (from Database CEN – Java (Set
1)). DateandTime1.java and DateandTime2.java form a pair of plagiarized files.
sorting3.java and sorting4.java form another pair of plagiarized files. Considering the
number of occurrences of all the tokens in each file as given in table 7.2, each file is
represented as a discrete signal in figure 7.6.
Table 7.2 Term Frequencies for Four Java Files
Terms
Term Frequency
DateandTime1.
java
DateandTime2.
java
sorting3.
java
sorting4.
java
boolean 1 0 1 1
catch 3 3 0 0
class 1 1 1 1
else 2 0 2 2
FALSE 3 2 1 1
for 0 0 15 15
if 4 0 7 7
import 5 5 0 0
int 4 3 11 11
new 7 6 3 3
null 3 3 0 0
public 10 8 7 7
return 2 0 0 0
static 9 7 6 6
TRUE 1 0 2 2
try 2 2 0 0
void 8 7 6 6
while 0 0 4 4
162
Figure 7.6. Discrete Signal Representation for Two Pairs of Plagiarized Java
Files
In DWT based analysis, rather than treating each file as a signal, each file is
considered as a collection of term signals. Figure 7.7 shows the term signal patterns
for two pairs of plagiarized Java source code files.
Figure 7.7. (a)
163
Figure 7.7. (b)
Figure 7.7. (c)
164
(d)
Figure 7.7. Term Signal Patterns for Two Pairs of Plagiarized Java Files a)
DateandTime1.java b) DateandTime2.java c) sorting3.java d) sorting4.java
It can be noted from figure 7.7 that the term signal patterns for the plagiarized
files are identical. Figure 7.8 shows the term spectra obtained after four levels of
decomposition using Daubechies 4-tap scaling and wavelet filters.
Figure 7.8. (a)
165
Figure 7.8. (b)
Figure 7.8. (c)
166
(d)
Figure 7.8. Term Spectra Obtained after Four Levels of Decomposition using
Daubechies 4-Tap Scaling and Wavelet Functions a) DateandTime1.java b)
DateandTime2.java c) sorting3.java d) sorting4.java
Figure 7.9 shows the clustering of plagiarized files: 1) DateandTime1.java 2)
DateandTime2.java 3) sorting3.java 4) sorting4.java. File scores can be computed in
different ways. Here, phase is also taken into consideration to calculate the score.
Figure 7.9. Clustering of Two Pairs of Plagiarized Java Files
167
Consider a set of 9 files consisting of 4 sets of plagiarized C files as given in
table 7.3. Grouping of these files based on the file scores obtained using equation 7.7
after first level decomposition using unnormalized Haar scaling and wavelet filters is
shown in figure 7.10.
Table 7.3 C Fileset 1 (Database CEN – C)
Set File No. File Name
Set 1 1
2
DerivativesThreePoint.c
DerivativesThreePoint_T3.c
Set 2 3
4
amountFunction.c
amountFunction_T3.c
Set 3 5
6
7
binToHex.c
binToHex_T1.c
binToHex_T3.c
Set 4 8
9
calculator.c
calculator_T3.c
Figure 7.10. Plagiarized Files in C Fileset 1 Grouped into Clusters Based on File
Scores Obtained after First Level Decomposition using Un-normalized Haar
Scaling and Wavelet Filters
168
The plagiarized files have similar scores and thus can be grouped into distinct
clusters. Using LSI, file-file similarity scores(for all possible pairs of files in the
database) are obtained which enable us to identify the percentage upto which two
given programs in the database are similar. DWT allow us to form clusters of similar
files. When there are a large number of files, a single cluster will contain files of
varying functionality. However, the plagiarized files are grouped into the same
cluster. A cluster may contain different sets of plagiarized files with similar file
scores. It may also contain files of different functionality which are not plagiarized.
These are identified only after a file-to-file comparison using AST matching.
7.4.2 System Evaluation and Performance
A comparison of the running times calculated when the combined approach
employing DWT and AST is used and when the combined two-phase approach
employing LSI and AST is used to detect similarities in source code files written in C
is shown in figure 7.11.
Figure 7.11. Performance Plots of Combined LSI and AST and Combined DWT
and AST Approaches
The use of Daub-4 wavelet in phase 1 has given better results than the use of
Haar wavelet. It can be noted from figure 7.12 that the precision and recall on using
Daub-4 wavelet in phase 1 is better as compared to that of LSI or Haar wavelet.
169
(a)
(b)
Figure 7.12. Precision and Recall Curves Obtained on Applyinging LSI, Haar
Wavelet Transform, and Daub-4 Wavelet Transform in Phase 1 a) Precision b)
Recall
170
7.5 CONCLUSIONS
The two-phase architecture for source code plagiarism detection using DWT
and AST matching has proved to be more efficient than that using LSI and AST
matching. DWT is used only for an initial screening of files. Though more efficient
than LSA, the efficiency and performance of DWT based source code analysis highly
depends on the selection of wavelet and the level of decomposition.