Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was...

49
School of Computer and Information Science CIS Research Placement Report User identification using n-gram analysis on command line histories Sunee Holland Date: 19/07/10 Supervisor: AsPr Helen Ashman Abstract A method for identifying authorship using n-gram analysis is presented. Custom software has been built for the purpose of evaluating the effectiveness of this method. The n-gram analysis

Transcript of Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was...

Page 1: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

School of Computer and Information Science

CIS Research Placement Report

User identification using n-gram analysis on command line histories

Sunee Holland

Date: 19/07/10

Supervisor: AsPr Helen Ashman

Abstract

A method for identifying authorship using n-gram analysis is presented. Custom software

has been built for the purpose of evaluating the effectiveness of this method. The n-gram

analysis technique is compared against one similar, commonly used method for authorship

identification: command frequency analysis. It was found that n-gram analysis can

successfully identify authorship. Additionally, it can reduce the chance of false positive

readings and increase clarity for interpreting results when compared to command

frequency analysis. Values for n were explored, and it is shown that a higher n can produce

easily identifiable similarities between same authors.

Page 2: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

Table of Contents

1 Introduction........................................................................................................................................................ 3

1.1 Motivation..................................................................................................................................................5

1.2 Research Questions................................................................................................................................5

1.3 Scope.............................................................................................................................................................6

2 Literature Review.............................................................................................................................................6

2.1 N-gram based analysis..........................................................................................................................7

2.2 Anomaly detection using command line data............................................................................9

3 Methodology.....................................................................................................................................................10

3.1 Implementation.....................................................................................................................................11

3.1.1 Software Architecture..............................................................................................................11

3.1.2 History Analysis Algorithms..................................................................................................11

3.1.3 Application Inputs and Outputs...........................................................................................14

3.2 User Study................................................................................................................................................15

4 Results.................................................................................................................................................................16

4.1 Command Frequency Analysis.......................................................................................................17

4.2 N-gram Analysis....................................................................................................................................19

4.2.1 N-gram values for n...................................................................................................................22

5 Future Work..................................................................................................................................................... 25

6 Conclusion.........................................................................................................................................................26

7 References.........................................................................................................................................................28

Page 3: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

Appendix A..................................................................................................................................................................29

Appendix B..................................................................................................................................................................38

Page 4: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

1 Introduction

N-grams are a subset of overlapping n-sized portions of a series of letters, words, syllables,

phonemes or base pairs. An n-gram analysis can be conducted to establish a view on the

frequency of grams in a given sequence. The research presented focuses on using n-grams

to track the frequency of single characters or groups of characters. The innovative

contribution is the application of n-gram frequency analysis on command line histories for

identifying authorship. N-gram frequencies can be used in the context of behavioural

intrusion detection systems (IDS) to determine user authorship using a percentage

comparison of text, command histories in particular.

N-grams have a variety of applications, such as malicious executable detection (Kolter, JZ &

Maloof, MA 2004, Abou-Assaleh, T, Cercone, N, Kešelj, V & Sweidan, R 2004), and language

classification (Cavnar, WB & Trenkle, JM 1994). Current literature in n-gram analysis

extends early ideas that were based on text categorisation (Cavnar, WB & Trenkle, JM 1994)

and authorship (Houvardas, J & Stamatatos, E 2006, Peng, F, Schuurmans, D, Wang, S &

Keselj, V 2003, Soboroff, IM, Nicholas, CK, Kukla, JM & Ebert, DS 1997). Generally, these

methods take an input of text, upon which an n-gram analysis is performed. At the start of a

sequence, an n-sized segment is inspected, the frequency of that gram is stored, and the

inspected window slides towards the right. An example of a character n-gram analysis

where n=2 and the word is “regret” would be { “re”, ”eg”, ”gr”, ”re”, ”et” }; the frequency for

each gram would be “re”=2, “eg”=1, “gr”=1, and “et”=1. The result is an overall view of how

frequent n-sized segments are, and this information can be used to characterise a particular

user.

This work on n-grams is intended to be used in the context of a behavioural IDS to

characterise a user based on their behaviour. Behavioural IDS are concerned with analysing

user behavioural patterns for the purpose of detecting anomalies (Balajinath, B & Raghavan,

SV 2001). The ability to characterise authorship based on behavioural patterns is pivotal to

the research presented in this dissertation. This can be achieved through performing an n-

gram analysis, which is currently absent in literature, rather than observing command

frequencies. N-gram analysis has the advantage of increased granularity and therefore

Page 5: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

increased accuracy. Additionally, grams are variable in length, and therefore provide

increased flexibility.

Custom software will be created which aims to present a method that combines n-gram

analysis with authorship identification which could potentially lead to anomaly detection in

an IDS in the future.

1.1 Motivation

The primary motivation of authorship identification using n-gram analysis on command line

histories is the potential for increased authorship identification accuracy. Previously,

behavioural IDS models have adopted command frequency analysis for the purpose of

authorship identification (Lane, T & Brodley, CE 1999, Balajinath, B & Raghavan, SV 2001,

Maxion, RA 2003). This method is appropriate, however, it is possible to explore whether n-

gram analysis could improve the accuracy of authorship identification. N-gram analysis

takes the order of the data into account, and therefore the inclusion of such should produce

a more unique characteristic, allowing individuals to be identified more accurately. The

value of placing n-grams in this context is further strengthened when combining with other

characteristic features such as keystroke dynamics in behavioural IDS, which could be

explored in future work.

There lacks the combination of n-gram analysis on command line histories within a

behavioural IDS in current literature. There has been one application of n-grams in

behavioural IDS where they were used to analyse application data rather than to

behaviourally identify users (Kolter, JZ & Maloof, MA 2004). Additionally, there are many

behavioural IDS that use command frequency analysis on command line histories but none

that use n-gram analysis (Tan, K 1995, Lane, T & Brodley, CE 1999, Balajinath, B &

Raghavan, SV 2001, Maxion, RA 2003). Due to its absence in this context, n-gram analysis

has the potential to contribute in a behavioural IDS.

1.2 Research Questions

The research contribution is the use of n-grams to characterise users by analysing

command histories. One high level question sums up the overall research problem in a

generalised fashion:

Page 6: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

What are the problems faced when investigating the use of n-grams to tabulate the

gram frequency of command histories?

The following questions crystallises the major issues that are addressed during the course

of the research.

Is it possible to uniquely identify users using n-gram analysis on command

histories?

What are the important differences between n-gram analysis and command

frequency analysis for the purpose of identifying authorship?

What are the possible values for n for achieving accurate results in the context of n-

gram analysis?

1.3 Scope

The research presented has a number of facets that won’t be considered. Firstly, n-gram

analysis is implemented on command line histories without the behavioural IDS

environment. It is important to test the proof of concept: that n-gram analysis can increase

the accuracy of authorship identification. However, the purpose of developing such a

solution lies in its role within a behavioural IDS environment, thus it would be wise to

pursue in future work. Additionally, in a behavioural IDS, the n-gram analysis can be

combined with other characterisation methods such as keystroke dynamics to improve the

uniqueness of authorship identification, but is not covered here. Lastly, n-gram analysis will

be applied to command line history only. N-gram analysis may have many other

applications for authorship or application data analysis; however, they will not be covered

in this research.

Page 7: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

2 Literature Review

The following contains a review of literature relevant to n-gram analysis and authorship

identification. It forms the basis for understanding how the presented research is able to

innovate from these already established ideas. Using n-gram analysis on command line

histories is shown to be absent from literature and has the potential for increased detection

accuracy in a behavioural IDS. Command frequency analyses are commonly used for the

purpose of authorship identification. However, one improvement n-gram analysis has over

this method is the finer granularity upon inspection and hence has the ability to better

characterise authors. Additionally, there exist various implementations of behavioural IDS

that use command frequency analysis for the purpose of authorship identification but none

that use n-gram analysis. One application of n-grams in behavioural IDS was used to analyse

application data but not to identify users (Kolter, JZ & Maloof, MA 2004). The following

sections describe the literature of n-gram based analysis and anomaly detection using

command line data.

2.1 N-gram based analysis

An early method of n-gram analysis had the primary goal of classifying text based on n-gram

frequency statistics (Cavnar, WB & Trenkle, JM 1994). The method of using n-grams was

more appropriate for inspecting text more closely and tolerating textual errors compared to

word frequencies. Text based categorisation is useful for identifying languages in text and

was tested by being applied to Usenet newsgroup articles written in different languages. It

was also successfully tested for the ability to classify according to the subject of the text

which was found by being applied to computer-oriented newsgroup articles. This research

was useful for forming the basis for classification using n-gram frequency analysis in a

variety of contexts.

Another earlier method of n-gram analysis was used to cluster documents based on usage

patterns (Soboroff, IM, Nicholas, CK, Kukla, JM & Ebert, DS 1997). Latent Semantic Indexing

is automatic topic based indexing of documents and is used the purpose of determining

writing style of documents. The use of this coincided with n-gram analysis on patterns of

usage to determine authorship of a document. It also looked at finding an optimal value for

n for the length of grams and reported that having a smaller n is better at capturing style

Page 8: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

characteristics as whole words occur more often. The main contribution was the

combination of n-gram analysis and Latent Semantic Indexing to classify the author of a

particular document.

Research continued to grow years later in the area of authorship using n-grams. The aim of

one particular application of this was the ability to predict the probability of naturally

occurring word sequences in any language using n-gram frequencies (Peng, F, Schuurmans,

D, Wang, S & Keselj, V 2003). This was used to categorise texts based on similarity. In the

experimental study, they rated performance using the measuring overall accuracy by the

number of correctly classified texts divided by the total number of texts classified. Similar to

previous research, the authors explored optimal values for n based on these accuracy

ratings. They found that a small n would not capture sufficient information and a large n will

create sparse data readings. Additionally, optimal n values were dependent on context. An n

of 3 for Greek and Chinese documents and an n of 6 for English documents were decided as

the values that gave the highest accuracy rating. As a result of testing, their simple n-gram

authorship algorithm was proven to improve on their previous text categorisation research.

The authors of another n-gram paper found that viruses were only found after they caused

damage and made themselves apparent. A solution was devised to reduce collateral damage

by providing a detailed characterisation algorithm that automatically detects malicious code

on the fly using byte level n-gram analysis (Abou-Assaleh, T, Cercone, N, Kešelj, V &

Sweidan, R 2004). To validate the usefulness of the algorithm, an experiment was

conducted that measured the rate of correct categorisation for malicious versus benign

code. Sample data included Windows executables extracted from email messages. The test

also included showing the accuracy for n with values ranging from 1 to 10. It concluded that

this n-gram method produced very high accuracy results, particularly when n was 3 or over.

A system was developed called the Malicious Executable Classification system was

produced for commercial purposes which detected malicious executables using n-grams

with byte codes as features (Kolter, JZ & Maloof, MA 2004). A combination of information

retrieval and text classification was used to achieve malicious executable detection. A pilot

study conducted aimed to determine the size of n-grams, size of words and number of

selected features. Using this information, an experiment was then performed using small

and large collections of executables and compared true positive rates to false positive rates

Page 9: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

to compare the performance of different algorithms, which they found their application to

be successful.

In more recent times, n-gram analysis continues to be used for authorship identification.

Researchers found the value in automatically comparing variable length n-grams for

selecting appropriate features for authorship identification (Houvardas, J & Stamatatos, E

2006). They tested this concept by using 3-grams, 4-grams, and 5-grams and comparing

them to each other. The basis of the comparison was to find the most frequent grams over a

range of variable n values and use them to determine authorship. The results supported the

observation that this new method improved on other feature selection methods at the time.

None of these methods are used on command line histories. Some of these methods were

used to characterise authorship on different types of user data such as documents and

malicious code. Therefore, there is an opening for n-gram analysis on command histories

for authorship identification.

2.2 Anomaly detection using command line data

Early research was performed in using neural networks to characterise user behaviour in

the detection of security violations (Tan, K 1995). Security violations were found by

detecting anomalous user behaviour through patterns in characteristics such as user

activity times, user login hosts, user foreign hosts, command set, and CPU usage. The novel

contribution was the ability to adapt to changing patterns through neural networks that

quickly learn new patterns and slowly forgets them. It was a notably useful method of

intrusion detection that is context tolerant, performs less processing, and consumes less

memory and disk space.

A method of anomaly detection focused on storage reduction when creating a user profile.

The aim was to behaviourally characterise using temporal sequences of discrete data (Lane,

T & Brodley, CE 1999). An experiment was conducted where they analysed UNIX shell

commands and used instance-based learning to generate profiles. They evaluated the

system by measuring the accuracy of detection and mean time to generate an alarm. The

results showed a success rate which was measured by the identification of malicious

occurrences and had a very low false positive rate.

An intrusion detection method uses a real time genetic learning algorithm as an approach to

developing a behaviour model for users (Balajinath, B & Raghavan, SV 2001). The system

Page 10: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

takes command line data as input and is referred to as user behaviour genes. A fitness

function determines whether an anomaly has occurred or not. An experiment conducted

showed that this method was effective at characterising user behaviour.

A method of intrusion detection deals with masquerade attacks which is where one user

pretends to be another. It compares the novel method of information enriched command

line data over truncated command line data for masquerade detection (Maxion, RA 2003).

Truncated data includes the command used, such as “cd”. Enriched command line data

includes entire commands, such as “cd <folder name>”. Test data included command line

histories from UNIX users. They found this method increased hits, reduced misses,

increased false alarms and reduced equal-basis cost of error.

These methods show that command line data was used for detecting anomalies in

behavioural IDS. None of these methods applied n-gram analysis to achieve authorship

identification or anomaly detection. As a result, there is ample opportunity to combine n-

gram analysis with authorship identification and behavioural IDS.

Page 11: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

3 Methodology

The following chapter outlines the steps that were taken to design and build the software

and how it will be evaluated. The tangible outcomes are described including architecture,

algorithms, and inputs/outputs of the system.

3.1 Implementation

The implementation section contains the logical flow of the software produced. The

algorithms that were used, such as command frequency analysis and n-gram analysis, are

explained in detail. The inputs and outputs that the application produces are also explored.

3.1.1 Software Architecture

The following is an overview on the application flow. The application was written in Java.

There are two classes: the main “Ngram.java” and the user defined data type “Data.java”.

With a Java ready machine and class files in the current directory, the application runs with

the command “java Ngram [n]” where n is the command line argument for the size of the

segment. It then finds all command histories which are files within that directory that start

with “history-“ and end with “.txt”. Next, it reads and parses each file by replacing line

breaks with spaces and removes any line numbers in the file. Each formatted input string is

tokenised into two separate HashMaps: one for file name association and one that points to

the user defined data type Data, which contain the frequency of the commands as well as

the file name and the command line argument value of n. A CSV (Comma Separated Values)

file is prepared for shared frequency n-gram analysis, n-gram comparison, shared frequency

command analysis and command frequency comparison which are described in more detail

in subsection 3.1.3. They are placed in the “csv” folder and can be read by Microsoft Excel or

equivalent and are ready for analysis which is explicated in chapter 4. Appendix A contains

the Java code that performs this.

3.1.2 History Analysis Algorithms

A brief description of command frequency analysis is provided for contrast, which should

help form an understanding the operation of n-grams. This section then describes the

logical flow of character based n-gram analysis. For the purpose of the developed

application and therefore this research, characters are selected as the feature for n-grams,

which means the value of n will refer to the number of characters within a gram.

Page 12: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

3.1.2.1 Command Frequency

Command frequency analysis is the tabulation of the frequency of commands, and in this

context is done on command histories. This has been implemented in the custom software

to provide a comparison with n-gram analysis which is used and later described in chapter

4. The following is a brief overview of how the command frequency analysis algorithm

operates. The first step takes an input, usually text. This text is tokenized by words and in

this way is similar to word based n-gram analysis without the variable length. The

application checks a table that stores the frequency of grams. If the current command

matches a record in the table, the frequency is incremented. If the current command hasn’t

been recorded before, a new record is created and the frequency is set to 1. It then looks at

the next command in the sequence. This process repeats until there are no commands left.

Appendix B contains more information on word n-gram analysis complete with examples

which covers very closely similar principals.

3.1.2.2 Character N-grams

N-gram analysis is the tabulation of the frequency of variable length grams, and like with

command frequency analysis it is performed on command histories. The following is an

example based description of how a character n-gram analysis algorithm operates. The first

step takes an input, usually text, such as “a_part”. It then takes the value of n and uses it to

inspect the first n-sized character segment, which if n=1 it would be “a” or if n=2 it would be

“a_”. The application checks a table that stores the frequency of grams. If the current gram

matches a record in the table, the frequency of that gram is incremented. If the current gram

hasn’t been recorded before, a new record is created and the frequency is set to 1. The

window of inspection shifts to the right one position for the next iteration, which if n=1 it

would be “_“ or if n=2 it would be “_p”. This process repeats until there are no n-sized grams

left. The entire resulting set of substrings for n=1 would be “a”, “_“, “p”, “a”, “r”, “t”, or for n=2

would be “a_“, “_p”, “pa”, “ar”, “rt”. Example 1 as below shows the flow of each iteration and

the values stored in the frequency table for an n-gram analysis where n=1 and the input is

“a_part”. Example 2 shows the same except n=2.

Example 1: Logical flow of a character 1-gram

For the string “a_part”

Page 13: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

1st iteration: “a_part” -> Frequency of the character “a” + 1

2nd iteration: “a_part” -> Frequency of the character “_“ + 1

3rd iteration: “a_part” -> Frequency of the character “p” + 1

4th iteration: “a_part” -> Frequency of the character “a” + 1

5th iteration: “a_part” -> Frequency of the character “r” + 1

6th iteration: “a_part” -> Frequency of the character “t” +1

Frequency table:

Gram Frequency

“a” 2

“_“ 1

“P” 1

“r” 1

“t” 1

Table 1—Frequency table for a character 1-gram for the string “a_part”.

Example 2: Logical flow of a character 2-gram

For the string “a_part”

1st iteration: “a_part” -> Frequency of the gram “a_” + 1

2nd iteration: “a_part” -> Frequency of the gram “_p“ + 1

3rd iteration: “a_part” -> Frequency of the gram “pa” + 1

4th iteration: “a_part” -> Frequency of the gram “ar” + 1

Page 14: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

5th iteration: “a_part” -> Frequency of the gram “rt” + 1

Frequency table:

Gram Frequency

“a_” 1

“_p“ 1

“pa” 1

“ar” 1

“rt” 1

Table 2—Frequency table for a character 2-gram for the string “a_part”.

3.1.3 Application Inputs and Outputs

Inputs

Input for the application includes a command history file containing commands from a

single user over an arbitrary period of time. The application finds all files that start with

“history-“ and end with “.txt”. All line breaks are replaced with spaces and all line numbers,

which are the first 4 characters in a line, are replaced with spaces. This is done to ensure the

data inputted is uniform and not tainted by irrelevant data. One other input the application

takes is the value of n. This is the length per gram used in inspecting a command history file.

Optimal values for n are discussed in section 4.2.1.

Outputs

The application produces the results of frequency analyses in CSV files. These files each

contain column titles and row titles followed by the values for that particular row. They are

able to be interpreted by Microsoft Excel or equivalent and displayed in table format. The

following describe the outputs of the application:

Page 15: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

N-gram frequency - Shows the entire contents of the frequency of each character

gram within a command history. Format of the file name is

<historyname>.ngram<n>.csv

Command frequency - Shows the entire contents of the frequency of each

command within a command history. Format of the file name is

<historyname>.cmdfreq.csv

Shared frequency n-gram analysis - Shows a comparison between two command

histories. It contains all the similar character grams and their frequencies side by

side. Format of the file name is

<history1name>.ngram<n>_vs_<history2name>.ngram<n>.csv

N-gram table - Shows the percentage of similarity of all the command histories

compared to each other using character n-gram analysis. Performed for each <n>.

Format of the file name is ngram<n>table.csv

Shared frequency command analysis - Shows a comparison between two

command histories. It contains all the similar commands and their frequencies side

by side. Format of the file name is

<history1name>.cmdfreq_vs_<history2name>.cmdfreq.csv

Command frequency table - Shows the percentage of similarity of all the command

histories compared to each other using command frequency analysis. Format of the

file name is cmdfreqtable.csv

3.2 User Study

There were five participants involved in the study. These people were asked to extract and

provide their command history file of their nominated command line shell that supports the

command history feature. For example, this may be done in a BASH shell by using the

command “history > file name” which pipes the result into a file. The files were of arbitrary

length depending on the time frame captured for which has a minimal impact on the results;

however, it is assumed that more data can create a more accurate characterisation. These

files were stripped of unnecessary information such as line numbering and surrounding

whitespace to ensure only user behaviour is being analysed. Some users were able to

provide multiple command histories which are a result of having multiple computers they

use such as a laptop. These files were then formatted to start with “history-“ and end in

“.txt” with the participant ID in the middle. These history files were then analysed using the

custom software.

Page 16: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

4 Results

The outputs produced by the application as described in section 3.1.3 are the precursor for

the results described in this chapter. These results intend to show that firstly, n-gram

analysis can successfully identify authors. A comparison of n-gram analysis and command

frequency analysis is shown to outline a more distinct rate of authorship similarity or

differences. Secondly, these results are able to indicate whether or not there is a value or

values for n that will produce better results. This chapter begins with the results of

command frequency analysis including an overall percentage similarity between

participants, and individual comparisons on highly similar profiles to verify that they are

likely to be the same author. Then, n-gram analysis is described which also shows an overall

percentage similarity between participants, and individual comparisons between

participants. It ends with a discussion on optimal values for n.

4.1 Command Frequency Analysis

The analysis process requires the outputs produced by the application as these indicate

how well individual authors are characterised. The application produces a list of commands

produced by the command frequency analysis entitled “<historyname>.cmdfreq.csv”. These

files contain each command with the number of times that command appears in that

particular command history. Each of these files is collated into a table of overall percentages

of similarity for all the command histories compared to each other. The similarity dictates

the percentage of commands that are similar between two histories. This is stored in the file

“cmdfreqtable.csv” and its output is shown in the table below.

Command frequency table

history-P01a.txt

history-P01b.txt

history-P02.txt

history-P03a.txt

history-P03b.txt

history-P04.txt

history-P05.txt

history-P01a.txt 100 12.72727 10.90909 10.90909 14.54545 16.36364 14.54545

history-P01b.txt 8.433735 100 10.84337 14.45783 19.27711 19.27711 8.433735

history-P02.txt 4.195804 6.293706 100 6.993007 9.79021 10.48951 4.195804

history-P03a.txt 2.419355 4.83871 4.032258 100 76.20968 6.451613 3.629032

history-P03b.txt 2.339181 4.678363 4.093567 55.26316 100 6.140351 3.216374

history-P04.txt 5.732484 10.19108 9.55414 10.19108 13.3758 100 5.732484

history-P05.txt 14.28571 12.5 10.71429 16.07143 19.64286 16.07143 100

Page 17: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

Table 3—Table for similarity percentages for all histories compared to each other using command frequency analysis.

From this table, we can see that every history is compared to itself with 100% similarity.

Most other comparisons don’t have a discernable difference between each other. One

exception is the percentage of similarity between histories P03a and P03b, which are

abnormally high. This is explained by the fact that they are histories from the same author

(P03). This is evidence that command frequency analysis can uniquely identify authors,

which has already been established in literature.

Histories from P01 (P01a and P01b) have a low similarity percentage, which is opposite to

what was expected. The type of usage for that particular participant was explained to be

quite different. Each command history was produced on a different computer with different

purposes, and the low similarity percentage therefore makes sense.

One other point to note from the above table is that some comparisons have higher

similarity percentages with a lack of discernable similarities. The problem lies within being

able to accurately identify high similarity percentages, and as a result, it increases the

chance of false positives occurring. This is primarily caused by command frequency analysis

only looking at commands and ignoring the ordering of such. N-gram analysis aims to solve

this by using same sized blocks to analyse and considers order of the data more stringently.

An additional part of the analysis involves a further comparison of histories with high

similarity percentages. This takes place because the similarity percentage is only based on

similar commands rather than similar frequencies of commands. Therefore, a high

similarity percentage alone cannot identify authorship. The application produces additional

CSV files for a side by side comparison of the similar commands for each history versus each

other. Files are formatted as “<history1name>.cmdfreq_vs_<history2name>.cmdfreq.csv”.

This was done with the command histories “history-P03a.txt” and “history-P03b.txt” which

are stored in the file “history-P03a.txt.cmdfreq_vs_history-P03b.txt.cmdfreq.csv”. It was

opened in Microsoft Excel and a bar graph was created with unique commands on the x axis

and frequency on the y axis. The colours for the graph were carefully chosen in an attempt

to enhance visibility. The black background was specifically for outlining the foreground

colours. “History-P03a.txt” is defined by a dark blue and “history-P03b.txt” is defined by a

grey. These colours are transparent and overlapping so that command history similarity can

Page 18: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

be shown by the combination of the two colours, which is light blue. The graph from P03a

versus P03b data is shown below.

0

20

40

60

80

100

120

140

160

180

P03a vs P03b command frequency analysis

Command

Freq

uenc

y

Figure 1—Comparison of similar grams for P03a vs. P03b using command frequency analysis.

From this graph, one can see that the data is quite sparse but there is a similarity between

both frequencies which is shown by the light blue. Since the frequencies of similar

commands are overlapping, it suggests that the two histories are probably produced by the

same author.

Given this information, it verifies current literature in that command frequency analysis is

successful at characterising authorship, but there is a potential for improvement, which is

detailed in the following section.

4.2 N-gram Analysis

The start of the n-gram analysis process is similar to the start of the command frequency

analysis. The application lists the frequency of n-grams produced by the n-gram analysis

through an outputted file called “<historyname>.ngram<n>.csv”. These files are produced

for each <n> that is chosen through command line arguments and for the purpose of this

experiment, values of n are 1, 2, 3, 4, 5, 7, 10 and 15. Every file contains each gram with the

number of times that gram appears in that particular command history. These files are used

Page 19: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

to collate into a table of overall percentages of similarity for all the command histories

compared to each other. The similarity dictates the percentage of commands that are

similar between two histories. There is also a unique table for every n. This is stored in the

file(s) “ngram<n>table.csv” and an example output is shown for a 15-gram analysis, which

was chosen as it gave discernable results.

15-gram frequency table

history-P01a.txt

history-P01b.txt

history-P02.txt

history-P03a.txt

history-P03b.txt

history-P04.txt

history-P05.txt

history-P01a.txt 100 0.533333 0.533333 0 0 1.2 0

history-P01b.txt 0.1999 100 0.649675 0.09995 0.09995 0.349825 0

history-P02.txt 0.105932 0.34428 100 0.026483 0.370763 1.006356 0

history-P03a.txt 0 0.03581 0.017905 100 66.05192 0.07162 0

history-P03b.txt 0 0.022452 0.157162 41.41221 100 0.157162 0

history-P04.txt 0.261248 0.203193 1.103048 0.11611 0.406386 100 0

history-P05.txt 0 0 0 0 0 0 100

Table 4—Table for similarity percentages for all histories compared to each other using 15-gram analysis.

Again, we can see that every history is compared to itself with 100% similarity. The same

author comparisons (P03a and P03b) are now very distinct and non similar comparisons

have very low percentages of similarity. N-gram analysis considers ordering of commands

and additionally processes the entire document rather than a single line at a time. Not only

can n-gram analysis uniquely identify authors, but it is shown that it can reduce the chance

of false positive readings.

Like with command frequency analysis, a further comparison of histories with high

similarity percentages is required. The application produces additional CSV files for a side

by side comparison of the similar commands for each history versus each other for each n.

Files are formatted as <history1name>.ngram<n>_vs_<history2name>.ngram<n>.csv”. This

was done with the same command histories “history-P03a.txt” and “history-P03b.txt”

stored in the file “history-P03a.txt.ngram15_vs_history-P03b.txt.ngram15.csv”. The 15-gram

analysis was chosen for consistency with the above table. It was opened in Microsoft Excel

and a bar graph was created with unique commands on the x axis and frequency on the y

axis. The colours for the graph were carefully chosen in an attempt to enhance visibility. The

Page 20: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

black background specifically outlines the foreground colours. “History-P03a.txt” is defined

by a dark blue and “history-P03b.txt” is defined by a grey. These colours are transparent

and overlapping so that command history similarity can be shown by the combination of

the two colours, which is light blue. The graph from P03a versus P03b data is shown below.

0

5

10

15

20

25

30

35

40

45

P03a vs P03b 15-gram analysis

Command

Freq

uenc

y

Figure 2—Comparison of similar grams for P03a vs. P03b using 15-gram analysis.

Compared to the command frequency graph featured in Figure 1, command data is no

longer sparse. The similarity between both frequencies is further noticeable with n-gram

analysis. Again, the frequencies of similar commands are overlapping which suggests that

the two histories are probably produced by the same author. The following graph shows the

same method comparing two dissimilar command histories (P02 versus P04). It is offered

for contrast to Figure 2 showing two same author command histories.

Page 21: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

0

10

20

30

40

50

60

70

80

90

P02 vs P04 15-gram analysis

Commands

Freq

uenc

y

Figure 3—Comparison of similar grams for P02 vs. P04 using 15-gram analysis.

Figure 3 outlines a clear difference between a different author comparison as above and a

same author comparison as in Figure 2. Similar commands are more sparse and reduced in

number. Also, more dark blue and grey bars are evident compared to Figure 2. These graphs

side by side indicate that n-gram analysis can successfully identify authorship.

It is hence shown that n-gram analysis can be provided instead of command frequency

analysis for methods that identify authorship. They show similar results, however, n-gram

analysis has the possibility of reducing false positive readings by providing more accurate

and clearer results. N-gram analysis also has the benefit of having variable length grams

which can be tailored to suit a particular purpose and is described in the following section.

4.2.1 N-gram values for n

N-gram analysis inspects n-sized grams while processing text. Variable length grams are

beneficial as they are more dynamic allowing the algorithm to be optimised to more

accurately identify users. N-gram analysis was performed with the values 1, 2, 3, 4, 5, 7, 10

and 15, however, tables are provided for the values 1, 3, 5, 10 and 15 to avoid needless

repetition. Each are compared to each other below, starting with a 1-gram frequency table.

Page 22: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

1-gram frequency table

history-P01a.txt

history-P01b.txt

history-P02.txt

history-P03a.txt

history-P03b.txt

history-P04.txt

history-P05.txt

history-P01a.txt 100 71.92982 94.73684 91.22807 98.24561 89.47368 80.70175

history-P01b.txt 78.84615 100 96.15385 96.15385 100 90.38462 82.69231

history-P02.txt 77.14286 71.42857 100 87.14286 95.71429 88.57143 77.14286

history-P03a.txt 77.61194 74.62687 91.04478 100 100 86.56716 79.10448

history-P03b.txt 71.79487 66.66667 85.89744 85.89744 100 83.33333 74.35897

history-P04.txt 68 62.66667 82.66667 77.33333 86.66667 100 74.66667

history-P05.txt 76.66667 71.66667 90 88.33333 96.66667 93.33333 100

Table 5—Table for similarity percentages for all histories compared to each other using 1-gram analysis.

As expected, all the comparisons have quite high similarity percentages, which reduce the

ability to differentiate authors. There is a high chance of false positive readings. The

following table shows a 3-gram analysis.

3-gram frequency table

history-P01a.txt

history-P01b.txt

history-P02.txt

history-P03a.txt

history-P03b.txt

history-P04.txt

history-P05.txt

history-P01a.txt 100 19.67593 17.82407 26.62037 32.63889 23.37963 17.36111

history-P01b.txt 12.72455 100 20.20958 32.78443 40.26946 24.8503 12.72455

history-P02.txt 7.077206 12.40809 100 24.08088 31.61765 22.33456 8.180147

history-P03a.txt 7.871321 14.98973 17.93292 100 86.51608 18.61739 10.74606

history-P03b.txt 7.186544 13.7105 17.53313 64.42406 100 18.24669 10.34659

history-P04.txt 10.28513 16.90428 24.74542 27.69857 36.45621 100 10.28513

history-P05.txt 17.77251 20.14218 21.09005 37.20379 48.10427 23.93365 100

Table 6—Table for similarity percentages for all histories compared to each other using 3-gram analysis.

The same author comparisons (P03a and P03b) are starting to stand out, but as with

command frequency analysis there is a relatively high chance of false positive readings still.

The following table shows a 5-gram analysis.

5-gram frequency table

history-P01a.txt

history-P01b.txt

history-P02.txt

history-P03a.txt

history-P03b.txt

history-P04.txt

history-P05.txt

Page 23: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

history-P01a.txt 100 7.21831 5.809859 6.514085 7.570423 7.922535 6.338028

history-P01b.txt 4.120603 100 4.723618 9.849246 13.46734 7.839196 5.125628

history-P02.txt 1.98915 2.833032 100 6.871609 10.72936 8.438819 2.049427

history-P03a.txt 1.45612 3.856749 4.486423 100 79.53562 4.565132 4.28965

history-P03b.txt 1.182293 3.684355 4.894144 55.56778 100 5.389057 3.876822

history-P04.txt 2.853519 4.9461 8.877616 7.355739 12.42866 100 3.360812

history-P05.txt 6.196213 8.777969 5.851979 18.76076 24.2685 9.122203 100

Table 7—Table for similarity percentages for all histories compared to each other using 5-gram analysis.

5-gram analysis shows improvement over the 3-gram and 1-gram analysis but still has the

same problem as with command frequency analysis where some comparisons have higher

similarity percentages than expected. There is still a chance for false positives. The

following table shows a 10-gram analysis.

10-gram frequency table

history-P01a.txt

history-P01b.txt

history-P02.txt

history-P03a.txt

history-P03b.txt

history-P04.txt

history-P05.txt

history-P01a.txt 100 2.832861 1.699717 0.424929 0.141643 3.399433 0.991501

history-P01b.txt 1.244555 100 1.1201 1.369011 1.493466 2.240199 0.746733

history-P02.txt 0.42523 0.637845 100 0.496102 1.311127 2.445074 0.070872

history-P03a.txt 0.068043 0.498979 0.317532 100 70.65094 0.635065 0.181447

history-P03b.txt 0.014747 0.35393 0.545642 45.93718 100 0.943814 0.221206

history-P04.txt 0.889878 1.334816 2.558398 1.038191 2.373007 100 0.667408

history-P05.txt 0.832342 1.426873 0.237812 0.951249 1.783591 2.140309 100

Table 8—Table for similarity percentages for all histories compared to each other using 10-gram analysis.

10-gram analysis shows a vast improvement over command frequency analysis results.

There is a low chance of false positive readings. The following table shows a 15-gram

analysis.

15-gram frequency table

history-P01a.txt

history-P01b.txt

history-P02.txt

history-P03a.txt

history-P03b.txt

history-P04.txt

history-P05.txt

history-P01a.txt 100 0.533333 0.533333 0 0 1.2 0

history-P01b.txt 0.1999 100 0.649675 0.09995 0.09995 0.349825 0

Page 24: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

history-P02.txt 0.105932 0.34428 100 0.026483 0.370763 1.006356 0

history-P03a.txt 0 0.03581 0.017905 100 66.05192 0.07162 0

history-P03b.txt 0 0.022452 0.157162 41.41221 100 0.157162 0

history-P04.txt 0.261248 0.203193 1.103048 0.11611 0.406386 100 0

history-P05.txt 0 0 0 0 0 0 100

Table 9—Table for similarity percentages for all histories compared to each other using 15-gram analysis.

15-gram analysis now shows unique authors more definitively. Great similarities are

required to identify authorship and non similar comparisons have low percentages as they

should. The false positive rate should be decreased significantly.

A trend is seen each time the value of n increments: significant similarities between

command histories become more clear which reduces false positive readings. While this

data suggests that a higher n will produce better results, it must be noted that an n that

exceeds the command history length will reduce accuracy, which is noted in previous

literature (Peng, F, Schuurmans, D, Wang, S & Keselj, V 2003). The variable length n allows

n-gram analysis to be flexible and customisable and is therefore an additional improvement.

Page 25: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

5 Future Work

There are some opportunities for continuing this work in the future in ways that still

haven’t been explored. The main future application for this work is integration with a

behavioural IDS. Additionally, a live analysis of incoming keystrokes would be an interesting

and useful endeavour. One flaw in the current design is that the similarity is based on

similar grams and doesn’t take into account the frequency of grams. If the initial analysis

process could consider the frequency of grams, it could reduce the extra steps for

comparing similar grams. Furthermore, optimisation of the application code would be

useful as the analysis process does take time to complete, and running an analysis on an

increased number of command histories at a time would further exacerbate this problem.

Other facets that could be explored include whitespace trimming, considering case

sensitivity, and algorithm optimisation for better results.

Page 26: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

6 Conclusion

Empirical data has shown that n-gram analysis can produce more accurate and easily

recognisable results than command frequency analysis for authorship identification. An

experiment was conducted which used custom software to perform an analysis on

command line histories with n-gram analysis and command frequency analysis. These

methods both tabulated the frequency of every command or n-gram. The result was used to

form a table of the similarity percentage for all histories compared to each other, which in

turn was used to gauge the effectiveness of command frequency analysis versus n-gram

analysis. N-gram analysis is also shown to have the advantage of flexibility due to the

variable gram length. Generally, increasing the gram length allows same authors to become

more identifiable and reduces the chance of false positive readings. N-gram analysis can be

provided as an alternative algorithm to command frequency analysis in a behavioural IDS as

they have closely similar functionality, however that is yet to be explored.

Page 27: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

7 References

Abou-Assaleh, T, Cercone, N, Kešelj, V & Sweidan, R 2004, 'N-Gram-Based Detection of New

Malicious Code'.

Balajinath, B & Raghavan, SV 2001, 'Intrusion detection through learning behavior model',

Computer Communications, vol. 24, no. 12, pp. 1202-1212.

Cavnar, WB & Trenkle, JM 1994, 'N-Gram-Based Text Categorization', Proceedings of SDAIR-

94, 3rd Annual Symposium on Document Analysis and Information Retrieval, New York,

USA.

Houvardas, J & Stamatatos, E 2006, 'N-Gram Feature Selection for Authorship

Identification', 12th International Conference on Artificial Intelligence : Methodology,

Systems, and Applications, Varna, Bulgaria.

Kolter, JZ & Maloof, MA 2004, 'Learning to detect malicious executables in the wild',

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery

and data mining, Seattle, WA, USA.

Lane, T & Brodley, CE 1999, 'Temporal sequence learning and data reduction for anomaly

detection', ACM Trans. Inf. Syst. Secur., vol. 2, no. 3, pp. 295-331.

Maxion, RA 2003, 'Masquerade detection using enriched command lines', Dependable

Systems and Networks, 2003. Proceedings.

Peng, F, Schuurmans, D, Wang, S & Keselj, V 2003, 'Language independent authorship

attribution using character level language models', Proceedings of the tenth conference on

European chapter of the Association for Computational Linguistics - Volume 1, Budapest,

Hungary.

Soboroff, IM, Nicholas, CK, Kukla, JM & Ebert, DS 1997, 'Visualizing document authorship

using n-grams and latent semantic indexing', Proceedings of the 1997 workshop on New

paradigms in information visualization and manipulation, Las Vegas, Nevada, United States.

Tan, K 1995, 'The application of neural networks to UNIX computer security', Neural

Networks, 1995. Proceedings., IEEE International Conference.

Page 28: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

Appendix A

Data.javaimport java.util.*;

public class Data {

public String name;

public HashMap<String, Integer> ngrams;

public HashMap<String, Integer> commandFreq;

public int ngram;

public Data(String name, HashMap<String, Integer> ngrams, HashMap<String, Integer> commandFreq, int ngram) {

this.name = name;

this.ngrams = ngrams;

this.commandFreq = commandFreq;

this.ngram = ngram;

}

}

Ngram.javaimport java.io.*;

import java.util.*;

public class Ngram {

private static final boolean DEBUG = false;

private static ArrayList<Data> data = new ArrayList<Data>();

public static void main(String[] args) {

Page 29: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

int n = 0;

// 1 Argument (n-gram)

if (args.length == 1) {

try {

n = Integer.parseInt(args[0]);

}

catch (NumberFormatException nfe) {

System.out.println("Could not parse n because it is not an Integer.");

}

// Find all history files: history-*.txt

FilenameFilter filenameFilter = new FilenameFilter() {

public boolean accept(File dir, String name) {

return name.endsWith(".txt") && name.startsWith("history-");

}

};

File currDir = new File(".");

String[] filenames = currDir.list(filenameFilter);

// Populate n-grams and command frequencies

for (int i = 0; i < filenames.length; i++) {

String fileName = filenames[i];

String fileContents = readFile(fileName);

// Original Contents (line breaks == spaces)

String orig = fileContents.replaceAll("\n", " ");

orig = orig.replaceAll("\r", "");

// No Line Numbers

String noLn = orig.replaceAll("[ ]+[0-9]{1,4}[ ]{2}", " ");

HashMap<String, Integer> ngramMap = ngram(noLn, n, fileName);

HashMap<String, Integer> commandFreqMap = commandFrequencies(noLn, fileName);

Page 30: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

data.add(new Data(fileName, ngramMap, commandFreqMap, n));

}

// N-Gram Table

String analysisCsv = "\"\",";

for (int i = 0; i < data.size(); i++) {

analysisCsv += "\"" + data.get(i).name + "\"";

if (i != data.size() - 1)

analysisCsv += ",";

}

analysisCsv += "\r\n";

for (int i = 0; i < data.size(); i++) {

analysisCsv += "\"" + data.get(i).name + "\",";

for (int j = 0; j < data.size(); j++) {

analysisCsv += "\"" + sharedKeys(data.get(i).ngrams, data.get(j).ngrams) + "\"";

sharedFreqNgramAnalysis(data.get(i), data.get(j));

if (j != data.size() - 1)

analysisCsv += ",";

}

analysisCsv += "\r\n";

}

writeFile("csv" + "/" + "ngram" + n + "table.csv", analysisCsv);

// Command Frequency Table

analysisCsv = "\"\",";

for (int i = 0; i < data.size(); i++) {

analysisCsv += "\"" + data.get(i).name + "\"";

if (i != data.size() - 1)

analysisCsv += ",";

Page 31: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

}

analysisCsv += "\r\n";

for (int i = 0; i < data.size(); i++) {

analysisCsv += "\"" + data.get(i).name + "\",";

for (int j = 0; j < data.size(); j++) {

analysisCsv += "\"" + sharedKeys(data.get(i).commandFreq, data.get(j).commandFreq) + "\"";

sharedFreqCmdAnalysis(data.get(i), data.get(j));

if (j != data.size() - 1)

analysisCsv += ",";

}

analysisCsv += "\r\n";

}

writeFile("csv" + "/" + "cmdfreqtable.csv", analysisCsv);

}

else {

System.out.println("Usage: java Ngram [n]");

}

}

private static double sharedKeys(HashMap<String, Integer> i, HashMap<String, Integer> j) {

Set c = i.keySet();

Iterator iter = c.iterator();

int count = 0;

while (iter.hasNext()) {

String key = (String)iter.next();

if (j.containsKey(key)) {

count++;

}

}

return (((double)count / (double)i.size()) * (double)100);

}

Page 32: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

private static void sharedFreqNgramAnalysis(Data i, Data j) {

Set c = i.ngrams.keySet();

Iterator iter = c.iterator();

int count = 0;

String csv = "\"n-gram\",\"" + i.name + "\",\"" + j.name + "\"\r\n";

while (iter.hasNext()) {

String key = (String)iter.next();

if (j.ngrams.containsKey(key)) {

String keyParsed = key.replaceAll("\"", "\"\"");

keyParsed = "=\"\"" + keyParsed + "\"\"";

csv += "\"" + keyParsed + "\",\"" + i.ngrams.get(key) + "\",\"" + j.ngrams.get(key) + "\"\r\n";

}

}

writeFile("csv" + "/" + i.name + ".ngram" + i.ngram + "_vs_" + j.name + ".ngram" + j.ngram + ".csv", csv);

}

private static void sharedFreqCmdAnalysis(Data i, Data j) {

Set c = i.commandFreq.keySet();

Iterator iter = c.iterator();

int count = 0;

String csv = "\"n-gram\",\"" + i.name + "\",\"" + j.name + "\"\r\n";

while (iter.hasNext()) {

String key = (String)iter.next();

if (j.commandFreq.containsKey(key)) {

String keyParsed = key.replaceAll("\"", "\"\"");

keyParsed = "=\"\"" + keyParsed + "\"\"";

csv += "\"" + keyParsed + "\",\"" + i.commandFreq.get(key) + "\",\"" + j.commandFreq.get(key) + "\"\r\n";

}

}

writeFile("csv" + "/" + i.name + ".cmdfreq" + "_vs_" + j.name + ".cmdfreq.csv", csv);

}

Page 33: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

private static HashMap<String, Integer> commandFrequencies(String fileContents, String fileName) {

HashMap<String, Integer> map = new HashMap<String, Integer>();

String csv = "\"Command/Argument\",\"Frequency\"\r\n";

String[] rawSplit = fileContents.split("(\"[(^\")\\s]*\")|([\\s]+)");

for (int i = 0; i < rawSplit.length; i++) {

if (!map.containsKey(rawSplit[i])) {

map.put(rawSplit[i], 1);

}

else {

int count = map.get(rawSplit[i]);

map.put(rawSplit[i], ++count);

}

}

Set c = map.keySet();

Iterator iter = c.iterator();

while (iter.hasNext()) {

String key = (String)iter.next();

if (DEBUG)

System.out.println(key + "\t" + map.get(key));

String keyParsed = key.replaceAll("\"", "\"\"");

keyParsed = "=\"\"" + keyParsed + "\"\"";

csv += "\"" + keyParsed + "\",\"" + map.get(key) + "\"\r\n";

}

writeFile("csv" + "/" + fileName + ".cmdfreq.csv", csv);

return map;

}

private static HashMap<String, Integer> ngram(String fileContents, int n, String fileName) {

Page 34: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

HashMap<String, Integer> map = new HashMap<String, Integer>();

String csv = "\"n-gram\",\"Frequency\"\r\n";

for (int i = 0; i <= fileContents.length() - n; i++) {

String gram = "";

for (int j = i; j < i + n; j++) {

gram += fileContents.charAt(j);

}

if (!map.containsKey(gram)) {

map.put(gram, 1);

}

else {

int count = map.get(gram);

map.put(gram, ++count);

}

}

Set c = map.keySet();

Iterator iter = c.iterator();

while (iter.hasNext()) {

String key = (String)iter.next();

if (DEBUG)

System.out.println(key + "\t" + map.get(key));

String keyParsed = key.replaceAll("\"", "\"\"");

keyParsed = "=\"\"" + keyParsed + "\"\"";

csv += "\"" + keyParsed + "\",\"" + map.get(key) + "\"\r\n";

}

writeFile("csv" + "/" + fileName + ".ngram" + n + ".csv", csv);

return map;

}

Page 35: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

private static String readFile(String fileName) {

FileReader fr = null;

BufferedReader br = null;

String s = "";

try {

fr = new FileReader(fileName);

br = new BufferedReader(fr);

String line = null;

while ((line = br.readLine()) != null) {

s += line + "\n";

}

}

catch (IOException ioe) {

System.out.println("Unable to read file: " + fileName);

}

finally {

try {

if (br != null)

br.close();

if (fr != null)

fr.close();

}

catch(IOException ioe) { }

}

return s;

}

private static void writeFile(String fileName, String fileContents) {

FileWriter fw = null;

try {

fw = new FileWriter(fileName);

fw.write(fileContents);

}

catch (IOException ioe) {

Page 36: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

System.out.println("Unable to write file: " + fileName);

}

finally {

try {

if (fw != null)

fw.close();

}

catch(IOException ioe) { }

}

}

}

Page 37: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

Appendix B

The following is an example based description of how a word n-gram analysis algorithm

operates. The first step takes an input, usually text, such as “a part of distance”. It then takes

the value of n and uses it to inspect the first n-sized word segment, which if n=1 it would be

“a” or if n=2 it would be “a part”. The application checks a table that stores the frequency of

grams. If the current gram matches a record in the table, the frequency of that gram is

incremented. If the current gram hasn’t been recorded before, a new record is created and

the frequency is set to 1. The window of inspection shifts to the right one position for the

next iteration, which is n=1 it would be “part“ or if n=2 it would be “part of”. This process

repeats until there are no n-sized grams left. The entire resulting set of substrings for n=1

would be “a”, “part”, or for n=2 it would be “a part”, “part of”, “of distance”.

Example 1: Logical flow of a character 1-gram

For the string “a part”

1st iteration: “a part” -> Frequency of the character “a” + 1

2nd iteration: “a part” -> Frequency of the character “part“ + 1

Frequency table:

Gram Frequency

“a” 1

“part“ 1

Example 2: Logical flow of a character 2-gram

For the string “a part”

1st iteration: “a part” -> Frequency of the character “a part” + 1

Page 38: Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was the ability to predict the probability of naturally occurring word sequences in any

Frequency table:

Gram Frequency

“a part” 1