Malware Analysis using Non-Signature based Method-Grijesh

Malware Analysis using Non-Signature based Method

Vinod P., V.Laxmi, M.S.Gaur, Grijesh Chauhan Department of Computer Engineering

Malaviya National Institute of Technology Jaipur, Rajasthan, India- 302 017

{vinodp, vlaxmi, gaurms}@mnit.ac.in, [email protected]

Abstract- In this paper we investigate non-signature technique for malware detection and demonstrate methods of feature selection suitable for feature reduction. Features of the form opcodes of instructions are extracted. The redundant features are eliminated using Scatter Criterion and Principal Component Analysis (PCA). The experiments are evaluated on malware data set collected from VX Heavens and benign executables (gathered from fresh installation of Windows XP operating system and other utility software's). The experiments are performed on both packed and unpacked data sets. The experiments also demonstrate that the proposed methods which donot require signatures are effective in identifying and classifying morphed malware.

Key words- malware, scatter criterion, principal component analysis, features, classifiers.

I. INTRODUCTION Malware is a general term used for viruses, Trojans,

rootkits, worms, ad ware, spyware etc. The goal of malicious software is to perform identity threats, consume system resources, allow unauthorized access to the compromised systems etc. Many antivirus products make use of signatures for detecting malware. Signature is a unique byte pattern or string capable of identifying a code as malicious. Signature based techniques have limitations as it (a) fail to detect encrypted code (b) lack semantics knowledge of the programs (c) increase in size of signature repository at an alarming rate and (d) fail to detect obfuscated malware (e) Signature generation require human expertise and is time consuming process.

Our research targets Win32 Portable Executable (PE) Files of both malware and benign samples. The motivation for selecting PE file was achieved by examining the frequencies of samples submitted to Virus Total [1]. We make use of machine learning methods for classifying the samples (benign or malware). Features in the form of instruction opcodes are extracted. Selected features are preprocessed using Scatter Criterion and Principal Component Analysis (PCA) to eliminate redundant features. The feature vector table is constructed for the malware and benign samples. The experiments have been evaluated using the decision trees [2], Instance Based Learners [2], SVM classifier [2] supported with WEKA [3].

Some key contributions of this paper are (a) selection of a classifier suitable for classification of samples with low false positives/negatives rates (b) disassembler module for detection, extraction of executable section and instruction opcodes of Win32 PE files (c) feature space reduction using

Scatter Criterion and Principal Component Analysis to minimize training /testing overheads.

The remainder of the paper is organized as follows: In Section II we present an overview of related work in the domain of data mining methods. In Section III a brief outline of the proposed methodology is introduced. In Section IV and Section V malware feature and its pre-processing is explained. Evaluation metrics is introduced in Section VI. The proposed methodologies for malware analysis using PCA and Scatter Criterion are explained in Section VII and Section VIII. Result analysis is covered in Section IX. Finally concluding remarks and future work is discussed in Section X.

II. RELATED WORK The authors in [4] proposed a ''phylogeny'' model using n-

gram features to generate new sequences, called n-perms. A heuristics based technique to detect unknown computer viruses was proposed in [5]. Non-signature based method using Self-Organizing Maps (SOMs) was proposed in [6] by Seon Yoo et al. Malicious code detection using text categorization and imbalance problem was proposed by authors in [7] [13].

The authors in [8] extracted n-grams from the benign and malicious executables and used k-nearest neighbour algorithm to identify the unseen instances. Oliver Henchiri et al [9] presented a method based on generic features applicable to different families of viruses. The classification accuracies of different classifiers were evaluated with the proposed classifier. The developed classifier reported higher detection rate. In their proposed work [10], the authors extracted the byte code features, relevant n-grams and evaluated on various classifiers. The authors in [11] proposed a method to identify file types using 1-gram analysis of binary contents. A non-signature based method using byte level file content was proposed in [12].

The authors in [14] proposed a new method for detecting variants of malware making use of Opcode-sequences. The relevant opcodes were mined and assigned certain weights. Malware analysis and detection using structural information of PE was proposed by authors in [15]. The analysis was performed on two test set one consisting of obfuscated malware and another clean malware samples. The detection rate of 92% was reported with obfuscated malware executables.

427

2011 International Conference on Network Communication and Computer (ICNCC 2011)

978-1-4244-9551-1/11/$26.00 2011 IEEE

III. BRIEF OUTLINE OF PROPOSED METHODOLOGY Principal instruction opcodes of executable

malware/benign) are extracted using the methods discussed in Section IV. Features selected consist of redundant features hence, promising features are extracted by feature processing methods like Scatter Criterion (SC) and Principal Component Analysis (PCA) (refer Section V).

The classifiers are trained using malware (packed and obfuscated), benign samples and tested with two different test sets. Figure [1] depicts different phases involved for the classification of samples (malware/benign).

IV. M ALWARE FEATURE In our proposed work, we have selected instruction

opcodes as the feature. Following are the different steps adopted for selecting features (a) Identification of valid PE samples (b) Extraction of executable sections and (c) Extraction of instruction opcode from raw data present in the executable sections.

Figure [2] and Figure [3] shows the method employed in identifying valid PE executable and its sections. The raw data obtained from the code section is used for extracting the opcodes. A typical x86 instruction contains instruction prefix, opcode, ModR/M, SIB, Displacement and Immediate [20]. Following steps are associated in extracting the opcodes from the raw stream collected from code section (a) Eliminate instruction prefixes from the raw stream of data obtained from executable sections (b) Read the opcode and use it to consult the Opcode Decision table to find the appropriate ModR/M (c) Use the ModR/M byte for extracting the operands based on different addressing modes.

Consider an .exe file Test.exe the extraction of instruction opcode is illustrated as under. Table 1 shows sample HEX Dump for the file Test.exe.

Instruction opcode is extracted from a sample HEX Dump (66 8138 4D 5A) as follows:

Instruction prefix (66 and 67) is ignored. 81 is a one byte opcode and 38 refers to MOD

R/M hence this information is preserved. 4D and 5A is immediate address and are not

considered. Thus, from raw data 66 8138 4D 5A only 8138 is preserved.

V. FEATURE PROCESSING All attributes extracted from a sample might not convey useful information and thus some attributes can be

eliminated. Exclusion of unwanted attributes can reduce the extra processing time spent during the training and testing phase. In our proposed method, we have used feature reduction methods which are briefly explained in Section (V-A) and Section (V-B).

A. Scatter Criteria Scatter criterion selects a feature based on the ratio of the

mixture scatter and within class scatter. The mixture scatter is the sum of within and between-class scatter. High value of this ratio indicates prominence of the feature for

classification. The within-class scatter for any feature f is computed as

=

=

C

iififw SPS

0,

where Sif is the variance for a class C i (malware or benign), and Pi is the prior probability for a class C i. The variance Sif can be computed as follows:

2

0)(1 if

N

jjifif FFN

S =

=

where Pi = 1/C Between class scatter Sbf is computed between variance

of class center with respect to a global center which is computed as:

2

0)( fif

N

iimf FFPS =

=

)( bfwf SStterMixtureSca += Scatter Criterion for f t h feature is thus

wf

mff S

SH =

A large value of H f , for a feature f, is an indicator of the feature being more discriminant for classification.

B. Principal Component Analysis Principal Component Analysis groups attributes

with similar information. The grouping of attribute is performed by measuring the correlation between them. If the correlation between any two attribute is high it is an indicator of attributes carrying similar information. Thus, instead of having two attributes only a single attribute can be used to build model.

Assessment of distinct groups is performed by monitoring the eigen values in the correlation matrix. Attributes having same order of eigen values are grouped together as they are indicate significance of information. It can be said that using PCA, we estimate m principal components. This is achieved by finding m eigen vectors having m largest eigen value of the covariance matrix of the data set.

VI. EVALUATION METRICS The performance of a classifier can be measured by

checking the True Positive Rate (TPR) and True Negative Rate (TNR) also known as the sensitivity and specificity. Following are different evaluation metrics used:

True Positive Rate (TPR): Is the ratio of actual positives correctly classified as positives i.e. TPR = TP/(TP + FN)

False Positive Rate (FPR): The proportion of benign samples incorrectly classified as malicious. This is also called false alarm rate or fall out. FPR = FP/ (FP + TN)

428


True Negative Rate (TNR): The proportion of negatives (benign samples) correctly identified as negatives. TNR = TN/(TN + FP)

False Negative Rate (FNR): The proportion of cases in which a test produces negative outcome. FNR = FN/ (FN + TP)

In case of a protection system, high value of TPR, TNR along with low FPR and FNR is required. This would ascertain that the scanner is capable of correctly identifying samples (malware or benign).

TABLE I. SAMPLE HEX DUMP FOR A FILE TEST.EXE

ADDRESS HEX DUMP INSTRUCTION 00401313 66 8138 4D 5A cmp word [eax],

0x5a4d

Figure 1. Proposed model for classifying malware and benign samples

Figure 2. Feature Selection: Detecting a Valid PE

Figure 3. Feature Selection: Extraction of bytes from executable sections

VII. PROPOSED METHOD FOR MALWARE ANALYSIS USING PRINCIPAL COMPONENT ANALYSIS

The term principal in this context refers to those instruction opcode obtained using PCA (as part of feature reduction). Principal opcodes are the discriminant feature of a class (Malware or Benign). Steps adopted in our proposed method are (a) Extract opcodes appearing with high frequency in samples (malware/benign) (b) Compute the eigenvalues and eigenvectors (c) Arrange the eigenvectors based on decreasing value of eigenvalues (d) Select relevant opcodes by fixing a lower threshold value, this corresponds to feature space (e) Feature vectors are created for malware and benign samples which are used to train the classifiers (f) Test the models with unknown samples which are not used during training. This is performed to validate the classification accuracy.

A. Experimental Setup Our experiments were performed on dataset consisting of

4546 Win32 PE executables. This dataset contains 1129 viruses downloaded from VX Heavens [16]. We obtained 1085 benign programs, from System32 folder of fresh installation of Windows XP operating system, some from Cygwin utility, games, Internet Browsers, media players and other sources. The experiments were carried out using WEKA [3] for training and testing. The training set consisted of 2214 samples (malware = 1129 and benign = 1085). 10fold cross validation was applied.

429


Further two Test sets is created (a) Test set1 consisting of 439 obfuscated malware and 443 benign samples (b) Test set2 consisting of 596 packed malware and 845 benign samples. The test set is kept separate and none of these samples are included during training. The experiments were performed using SMO, IBK, AdaBoost1 (with J-48 as base classifier), J-48 and Random Forest (RF) algorithms implemented in WEKA. The results are evaluated using the metrics defined in Section VI. For each experiment, different features are used for preparing feature vectors which are listed below:

Opcodes which are predominant (91 principal opcodes) in malware (1129 samples) but occur with less frequency in benign samples (refer Table II).

Opcodes which are predominant (35 principal opcodes) in benign (1085 samples) but have less frequency in malware samples (refer Table III).

Opcodes present in malware (18 principal opcode) but absent in benign samples (M-B) (refer Table V).

Opcodes present in benign (124 principal opcode) but absent in malware samples (B-M) (refer Table IV).

VIII. EXTRACTION OF PRINCIPAL INSTRUCTION OPCODES USING SCATTER CRITERION

Scatter Criterion (refer Section V-A) is used to reduce the redundant instruction opcodes from benign and malware executables. For the malware data set (consisting of 1129 executables), total 193 distinct instruction opcodes were extracted. Using scatter criterion, the feature space (193 unique instruction opcodes) was reduced to 37 opcodes based on the values of Hf . Likewise for benign samples (576 instruction opcodes), the feature space was reduced to 57 prominent opcodes. Classifiers were trained using 2214 executables (malware/benign) and tested using Test set1 (439 obfuscated malware, 443 benign samples) and Test set2 (596 packed malware, 854 benign samples). The experiments were evaluated using the evaluation metrics of Section VI. Table VI and Table VII depict the classifier accuracies for features gathered using Scatter Criterion for both malware and benign samples.

IX. RESULT ANALYSIS We have proposed malware analysis using machine

learning methods, which is capable of detecting unseen malware. Prominent opcodes were extracted using Principal Component Analysis and Scatter Criterion. From our experiments, we observe that improved classification accuracy with low false alarms is obtained. Some important observations made from the experiments are (a) The detection accuracy of the proposed method (using PCA) is reasonable. This can be validated by a high TPR = 0.96 and low FNR = 0.04 for Test set1 (b) The proposed detection methods works well for both packed/obfuscated malware samples (c) The proposed prototype also performs better for classifying unseen benign samples (d) From the experiments, we have found that the behaviour of classifiers like

AdaBoost1 and Random Forest are better for classifying malware (packed and obfuscated) and benign samples. The main reasons for the better classification are due to the bagging and boosting properties (e) Comparing the features obtained using Principal Component Analysis and Scatter Criterion we observe, that better classification accuracies are obtained with features extracted using PCA.

X. CONCLUSIONS AND FUTURE WORK In this paper, we have proposed a nonsignature based

detection techniques capable of identifying obfuscated and packed malware samples. The detection mechanism is performed using extraction of principal instruction opcodes. The experiments were performed using two different Test sets one consisting of obfuscated malware and another having packed malware executables. The results are validated using evaluation metrics computed by various classifiers. The principal opcodes extracted using our proposed method gave the classification accuracies of TPR = 0.961 and TNR = 0.90. The results obtained were superior to earlier method proposed by authors in [16] which was 92.5% for obfuscated samples.

The future development of our detection system can be the use of hybrid features like mnemonic ngrams, instruction opcode, APIs and some PE structural informations would be explored.

ACKNOWLEDGEMENT The authors are grateful to the Ministry of

Communication and Information Technology, Government of India, for supporting and funding this project.

REFERENCES [1] Virus Total. http://www.virustotal.com/stats.html. [2] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine

Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, October 1999.

[3] Open source Machine Learning Software Weka. http://www.cs.waikato.ac.nz/ml/weka/.

[4] Karim, Md. and Walenstein, Andrew and Lakhotia, Arun and Parida, Laxmi, Malware phylogeny generation using permutations of code, Journal in Computer Virology, Springer Paris, pp.13-23, vol no 1, 2005.

[5] Schultz, M.G.; Eskin, E.; Zadok, F.; Stolfo, S.J.; , "Data Mining Methods for Detection of New Malicious Executables," Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on , pp.38-49, 2001.

[6] Yoo, In and Ultes-Nitsche, Ulrich, Non-Signature based Virus Detection, Journal in Computer Virology, Springer Paris, pp163-186, vol. 2, issue 3, 2006.

[7] Moskovitch, R.; Stopel, D.; Feher, C.; Nissim, N.; Elovici, Y.; , "Unknown Malcode Detection via Text Categorization and the Imbalance Problem," IEEE International Conference on Intelligence and Security Informatics, 2008, (ISI 08)., pp.156-161, 17-20 June 2008.

[8] J.O.Kephart and B.Arnold. N-grams-Based File Signatures For Malware Detection. pages 178184.

[9] Henchiri, Olivier and Japkowicz, Nathalie, "A Feature Selection and Evaluation Scheme for Computer Virus Detection", In Proceedings of the Sixth International Conference on Data Mining (ICDM '06), 2006, pp. 891-895,IEEE Computer Society, Washington, DC, USA

430


[10] Kolter, J. Zico and Maloof, Marcus A., Learning to Detect and Classify Malicious Executables in the Wild, J. Mach. Learn. Res., vol. 7, December, 2006, pp. 2721-2744. , volume 5965 of Lecture Notes in Computer Science, pages 3543. Springer Berlin / Heidelberg, (2010).

[11] Wei-Jen Li, Ke Wang, Stolfo, S.J. Herzog, B. , "Fileprints: identifying file types by n-gram analysis," Information Assurance Workshop, 2005. IAW '05. Proceedings from the Sixth Annual IEEE SMC , pp. 64- 71, 15-17 June 2005.

[12] Tabish, S. Momina and Shafiq, M. Zubair and Farooq, Muddassar, Malware detection using statistical analysis of byte-level file content, In Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics (CSI-KDD '09), 2009, pp. 23-31.

[13] Andrew Walenstein, Michael Venable, Matthew Hayes, Christopher Thompson and Arun Lakhotia, A.: Exploiting similarity between variants to defeat malware: Vilo method for comparing and searching binary programs, In: Proceedings of BlackHat DC 2007. (2007), https: //blackhat.com/presentations/bh-dc-07/Walenstein/Paper/bh-dc-07-walenstein-WP.pdf.

[14] Igor Santos, Felix Brezo, Javier Nieves, Yoseba Penya, Borja Sanz, Carlos Laorden, and Pablo Bringas. Idea: Opcode Sequence- based malware detection, Engineering Secure Software and Systems, volume 5965 of Lecture Notes in Computer Science, pages 3543. Springer Berlin / Heidelberg, (2010).

[15] Ronny Merkel, Tobias Hoppe, Christian Kraetzer, and Jana Dittmann. Statistical Detection of Malicious PE-Executables for Fast Offline Analysis. In Bart De Decker and Ingrid Schaumller-Bichl, editors, Communications and Multimedia Security, volume 6109 of Lecture Notes in Computer Science, pages 93105. Springer Berlin / Heidelberg, 2010.

[16] VX Heavens. http://vx.netlux.org/lib. [17] Intel 64 and IA-32 Architectures Software Developer's Manual

http://www.intel.com/products/processor/manuals/index.htm.

TABLE II. TABLE II: VALUES OF TPR, TNR, FPR, FNR (TEST SET1 AND TEST SET2) PREDOMINANT OPCODES

PRESENT IN MALWARE SAMPLES (FEATURE LENGTH = 91 PRINCIPAL OPCODE)

TABLE III. VALUES OF TPR, TNR, FPR, FNR (TEST SET1 AND TEST SET2) FOR PREDOMINANT OPCODES PRESENT IN

BENIGN SAMPLES (FEATURE LENGTH = 35 PRINCIPAL OPCODE)

TABLE IV. CLASSIFIER ACCURACIES USING DISCRIMINANT OPCODES OF BENIGN SAMPLES (TEST SET1

AND TEST SET2)

Classifiers Test Set1 Test Set2

TPR FNR TNR FPR TPR FNR TNR FPR SMO 0.949 0.05 0.559 0.44 0.974 0.025 0.503 0.496IBK 0.904 0.095 0.884 0.115 0.954 0.045 0.844 0.155

AdaBoost1 0.89 0.109 0.92 0.079 0.932 0.067 0.878 0.121J48 0.9 0.1 0.871 0.128 0.914 0.085 0.8 0.199RF 0.9 0.1 0.912 0.088 0.966 0.033 0.86 0.139

TABLE V. VALUES OF TPR, TNR, FPR, FNR (TEST SET1 AND TEST SET2) OBTAINED FOR OPCODES PRESENT IN

MALWARE AND ABSENT IN BENIGN SAMPLES



AdaBoost1 0.829 0.17 0.887 0.112 0.892 0.107 0.779 0.22J48 0.808 0.191 0.882 0.117 0.887 0.112 0.806 0.193RF 0.842 0.157 0.857 0.142 0.917 0.082 0.797 0.202

TABLE VI. VALUES OF TPR, TNR, FPR, FNR (TEST SET1 AND TEST SET2) FOR PROMINENT OPCODES OF MALWARE USING SCATTER

CRITERION



AdaBoost1 0.922 0.077 0.894 0.101 0.956 0.043 0.854 0.145J48 0.927 0.072 0.893 0.106 0.953 0.046 0.833 0.166RF 0.938 0.061 0.891 0.108 0.976 0.023 0.84 0.159

TABLE VII. VALUES OF TPR, TNR, FPR, FNR (TEST SET1 AND TEST SET2) FOR PROMINENT OPCODES OF BENIGN EXECUTABLE

USING SCATTER CRITERION



AdaBoost1 0.908 0.091 0.905 0.094 0.959 0.04 0.867 0.132J48 0.89 0.109 0.909 0.09 0.922 0.077 0.855 0.144RF 0.927 0.072 0.911 0.088 0.961 0.038 0.852 0.147



AdaBoost1 0.917 0.091 0.925 0.074 0.958 0.041 0.854 0.145J48 0.924 0.079 0.902 0.088 0.827 0.172 0.809 0.19RF 0.961 0.041 0.898 0.101 0.956 0.043 0.823 0.176



AdaBoost1 0.917 0.082 0.925 0.074 0.949 0.05 0.823 0.176J48 0.922 0.077 0.902 0.088 0.939 0.06 0.843 0.156RF 0.961 0.038 0.898 0.101 0.958 0.041 0.838 0.161

431


Malware Analysis using Non-Signature based Method-Grijesh

Documents

Transcript of Malware Analysis using Non-Signature based Method-Grijesh