Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was...
Transcript of Technical Report - wiki.cis.unisa.edu.au Web viewThe aim of one particular application of this was...
School of Computer and Information Science
CIS Research Placement Report
User identification using n-gram analysis on command line histories
Sunee Holland
Date: 19/07/10
Supervisor: AsPr Helen Ashman
Abstract
A method for identifying authorship using n-gram analysis is presented. Custom software
has been built for the purpose of evaluating the effectiveness of this method. The n-gram
analysis technique is compared against one similar, commonly used method for authorship
identification: command frequency analysis. It was found that n-gram analysis can
successfully identify authorship. Additionally, it can reduce the chance of false positive
readings and increase clarity for interpreting results when compared to command
frequency analysis. Values for n were explored, and it is shown that a higher n can produce
easily identifiable similarities between same authors.
Table of Contents
1 Introduction........................................................................................................................................................ 3
1.1 Motivation..................................................................................................................................................5
1.2 Research Questions................................................................................................................................5
1.3 Scope.............................................................................................................................................................6
2 Literature Review.............................................................................................................................................6
2.1 N-gram based analysis..........................................................................................................................7
2.2 Anomaly detection using command line data............................................................................9
3 Methodology.....................................................................................................................................................10
3.1 Implementation.....................................................................................................................................11
3.1.1 Software Architecture..............................................................................................................11
3.1.2 History Analysis Algorithms..................................................................................................11
3.1.3 Application Inputs and Outputs...........................................................................................14
3.2 User Study................................................................................................................................................15
4 Results.................................................................................................................................................................16
4.1 Command Frequency Analysis.......................................................................................................17
4.2 N-gram Analysis....................................................................................................................................19
4.2.1 N-gram values for n...................................................................................................................22
5 Future Work..................................................................................................................................................... 25
6 Conclusion.........................................................................................................................................................26
7 References.........................................................................................................................................................28
Appendix A..................................................................................................................................................................29
Appendix B..................................................................................................................................................................38
1 Introduction
N-grams are a subset of overlapping n-sized portions of a series of letters, words, syllables,
phonemes or base pairs. An n-gram analysis can be conducted to establish a view on the
frequency of grams in a given sequence. The research presented focuses on using n-grams
to track the frequency of single characters or groups of characters. The innovative
contribution is the application of n-gram frequency analysis on command line histories for
identifying authorship. N-gram frequencies can be used in the context of behavioural
intrusion detection systems (IDS) to determine user authorship using a percentage
comparison of text, command histories in particular.
N-grams have a variety of applications, such as malicious executable detection (Kolter, JZ &
Maloof, MA 2004, Abou-Assaleh, T, Cercone, N, Kešelj, V & Sweidan, R 2004), and language
classification (Cavnar, WB & Trenkle, JM 1994). Current literature in n-gram analysis
extends early ideas that were based on text categorisation (Cavnar, WB & Trenkle, JM 1994)
and authorship (Houvardas, J & Stamatatos, E 2006, Peng, F, Schuurmans, D, Wang, S &
Keselj, V 2003, Soboroff, IM, Nicholas, CK, Kukla, JM & Ebert, DS 1997). Generally, these
methods take an input of text, upon which an n-gram analysis is performed. At the start of a
sequence, an n-sized segment is inspected, the frequency of that gram is stored, and the
inspected window slides towards the right. An example of a character n-gram analysis
where n=2 and the word is “regret” would be { “re”, ”eg”, ”gr”, ”re”, ”et” }; the frequency for
each gram would be “re”=2, “eg”=1, “gr”=1, and “et”=1. The result is an overall view of how
frequent n-sized segments are, and this information can be used to characterise a particular
user.
This work on n-grams is intended to be used in the context of a behavioural IDS to
characterise a user based on their behaviour. Behavioural IDS are concerned with analysing
user behavioural patterns for the purpose of detecting anomalies (Balajinath, B & Raghavan,
SV 2001). The ability to characterise authorship based on behavioural patterns is pivotal to
the research presented in this dissertation. This can be achieved through performing an n-
gram analysis, which is currently absent in literature, rather than observing command
frequencies. N-gram analysis has the advantage of increased granularity and therefore
increased accuracy. Additionally, grams are variable in length, and therefore provide
increased flexibility.
Custom software will be created which aims to present a method that combines n-gram
analysis with authorship identification which could potentially lead to anomaly detection in
an IDS in the future.
1.1 Motivation
The primary motivation of authorship identification using n-gram analysis on command line
histories is the potential for increased authorship identification accuracy. Previously,
behavioural IDS models have adopted command frequency analysis for the purpose of
authorship identification (Lane, T & Brodley, CE 1999, Balajinath, B & Raghavan, SV 2001,
Maxion, RA 2003). This method is appropriate, however, it is possible to explore whether n-
gram analysis could improve the accuracy of authorship identification. N-gram analysis
takes the order of the data into account, and therefore the inclusion of such should produce
a more unique characteristic, allowing individuals to be identified more accurately. The
value of placing n-grams in this context is further strengthened when combining with other
characteristic features such as keystroke dynamics in behavioural IDS, which could be
explored in future work.
There lacks the combination of n-gram analysis on command line histories within a
behavioural IDS in current literature. There has been one application of n-grams in
behavioural IDS where they were used to analyse application data rather than to
behaviourally identify users (Kolter, JZ & Maloof, MA 2004). Additionally, there are many
behavioural IDS that use command frequency analysis on command line histories but none
that use n-gram analysis (Tan, K 1995, Lane, T & Brodley, CE 1999, Balajinath, B &
Raghavan, SV 2001, Maxion, RA 2003). Due to its absence in this context, n-gram analysis
has the potential to contribute in a behavioural IDS.
1.2 Research Questions
The research contribution is the use of n-grams to characterise users by analysing
command histories. One high level question sums up the overall research problem in a
generalised fashion:
What are the problems faced when investigating the use of n-grams to tabulate the
gram frequency of command histories?
The following questions crystallises the major issues that are addressed during the course
of the research.
Is it possible to uniquely identify users using n-gram analysis on command
histories?
What are the important differences between n-gram analysis and command
frequency analysis for the purpose of identifying authorship?
What are the possible values for n for achieving accurate results in the context of n-
gram analysis?
1.3 Scope
The research presented has a number of facets that won’t be considered. Firstly, n-gram
analysis is implemented on command line histories without the behavioural IDS
environment. It is important to test the proof of concept: that n-gram analysis can increase
the accuracy of authorship identification. However, the purpose of developing such a
solution lies in its role within a behavioural IDS environment, thus it would be wise to
pursue in future work. Additionally, in a behavioural IDS, the n-gram analysis can be
combined with other characterisation methods such as keystroke dynamics to improve the
uniqueness of authorship identification, but is not covered here. Lastly, n-gram analysis will
be applied to command line history only. N-gram analysis may have many other
applications for authorship or application data analysis; however, they will not be covered
in this research.
2 Literature Review
The following contains a review of literature relevant to n-gram analysis and authorship
identification. It forms the basis for understanding how the presented research is able to
innovate from these already established ideas. Using n-gram analysis on command line
histories is shown to be absent from literature and has the potential for increased detection
accuracy in a behavioural IDS. Command frequency analyses are commonly used for the
purpose of authorship identification. However, one improvement n-gram analysis has over
this method is the finer granularity upon inspection and hence has the ability to better
characterise authors. Additionally, there exist various implementations of behavioural IDS
that use command frequency analysis for the purpose of authorship identification but none
that use n-gram analysis. One application of n-grams in behavioural IDS was used to analyse
application data but not to identify users (Kolter, JZ & Maloof, MA 2004). The following
sections describe the literature of n-gram based analysis and anomaly detection using
command line data.
2.1 N-gram based analysis
An early method of n-gram analysis had the primary goal of classifying text based on n-gram
frequency statistics (Cavnar, WB & Trenkle, JM 1994). The method of using n-grams was
more appropriate for inspecting text more closely and tolerating textual errors compared to
word frequencies. Text based categorisation is useful for identifying languages in text and
was tested by being applied to Usenet newsgroup articles written in different languages. It
was also successfully tested for the ability to classify according to the subject of the text
which was found by being applied to computer-oriented newsgroup articles. This research
was useful for forming the basis for classification using n-gram frequency analysis in a
variety of contexts.
Another earlier method of n-gram analysis was used to cluster documents based on usage
patterns (Soboroff, IM, Nicholas, CK, Kukla, JM & Ebert, DS 1997). Latent Semantic Indexing
is automatic topic based indexing of documents and is used the purpose of determining
writing style of documents. The use of this coincided with n-gram analysis on patterns of
usage to determine authorship of a document. It also looked at finding an optimal value for
n for the length of grams and reported that having a smaller n is better at capturing style
characteristics as whole words occur more often. The main contribution was the
combination of n-gram analysis and Latent Semantic Indexing to classify the author of a
particular document.
Research continued to grow years later in the area of authorship using n-grams. The aim of
one particular application of this was the ability to predict the probability of naturally
occurring word sequences in any language using n-gram frequencies (Peng, F, Schuurmans,
D, Wang, S & Keselj, V 2003). This was used to categorise texts based on similarity. In the
experimental study, they rated performance using the measuring overall accuracy by the
number of correctly classified texts divided by the total number of texts classified. Similar to
previous research, the authors explored optimal values for n based on these accuracy
ratings. They found that a small n would not capture sufficient information and a large n will
create sparse data readings. Additionally, optimal n values were dependent on context. An n
of 3 for Greek and Chinese documents and an n of 6 for English documents were decided as
the values that gave the highest accuracy rating. As a result of testing, their simple n-gram
authorship algorithm was proven to improve on their previous text categorisation research.
The authors of another n-gram paper found that viruses were only found after they caused
damage and made themselves apparent. A solution was devised to reduce collateral damage
by providing a detailed characterisation algorithm that automatically detects malicious code
on the fly using byte level n-gram analysis (Abou-Assaleh, T, Cercone, N, Kešelj, V &
Sweidan, R 2004). To validate the usefulness of the algorithm, an experiment was
conducted that measured the rate of correct categorisation for malicious versus benign
code. Sample data included Windows executables extracted from email messages. The test
also included showing the accuracy for n with values ranging from 1 to 10. It concluded that
this n-gram method produced very high accuracy results, particularly when n was 3 or over.
A system was developed called the Malicious Executable Classification system was
produced for commercial purposes which detected malicious executables using n-grams
with byte codes as features (Kolter, JZ & Maloof, MA 2004). A combination of information
retrieval and text classification was used to achieve malicious executable detection. A pilot
study conducted aimed to determine the size of n-grams, size of words and number of
selected features. Using this information, an experiment was then performed using small
and large collections of executables and compared true positive rates to false positive rates
to compare the performance of different algorithms, which they found their application to
be successful.
In more recent times, n-gram analysis continues to be used for authorship identification.
Researchers found the value in automatically comparing variable length n-grams for
selecting appropriate features for authorship identification (Houvardas, J & Stamatatos, E
2006). They tested this concept by using 3-grams, 4-grams, and 5-grams and comparing
them to each other. The basis of the comparison was to find the most frequent grams over a
range of variable n values and use them to determine authorship. The results supported the
observation that this new method improved on other feature selection methods at the time.
None of these methods are used on command line histories. Some of these methods were
used to characterise authorship on different types of user data such as documents and
malicious code. Therefore, there is an opening for n-gram analysis on command histories
for authorship identification.
2.2 Anomaly detection using command line data
Early research was performed in using neural networks to characterise user behaviour in
the detection of security violations (Tan, K 1995). Security violations were found by
detecting anomalous user behaviour through patterns in characteristics such as user
activity times, user login hosts, user foreign hosts, command set, and CPU usage. The novel
contribution was the ability to adapt to changing patterns through neural networks that
quickly learn new patterns and slowly forgets them. It was a notably useful method of
intrusion detection that is context tolerant, performs less processing, and consumes less
memory and disk space.
A method of anomaly detection focused on storage reduction when creating a user profile.
The aim was to behaviourally characterise using temporal sequences of discrete data (Lane,
T & Brodley, CE 1999). An experiment was conducted where they analysed UNIX shell
commands and used instance-based learning to generate profiles. They evaluated the
system by measuring the accuracy of detection and mean time to generate an alarm. The
results showed a success rate which was measured by the identification of malicious
occurrences and had a very low false positive rate.
An intrusion detection method uses a real time genetic learning algorithm as an approach to
developing a behaviour model for users (Balajinath, B & Raghavan, SV 2001). The system
takes command line data as input and is referred to as user behaviour genes. A fitness
function determines whether an anomaly has occurred or not. An experiment conducted
showed that this method was effective at characterising user behaviour.
A method of intrusion detection deals with masquerade attacks which is where one user
pretends to be another. It compares the novel method of information enriched command
line data over truncated command line data for masquerade detection (Maxion, RA 2003).
Truncated data includes the command used, such as “cd”. Enriched command line data
includes entire commands, such as “cd <folder name>”. Test data included command line
histories from UNIX users. They found this method increased hits, reduced misses,
increased false alarms and reduced equal-basis cost of error.
These methods show that command line data was used for detecting anomalies in
behavioural IDS. None of these methods applied n-gram analysis to achieve authorship
identification or anomaly detection. As a result, there is ample opportunity to combine n-
gram analysis with authorship identification and behavioural IDS.
3 Methodology
The following chapter outlines the steps that were taken to design and build the software
and how it will be evaluated. The tangible outcomes are described including architecture,
algorithms, and inputs/outputs of the system.
3.1 Implementation
The implementation section contains the logical flow of the software produced. The
algorithms that were used, such as command frequency analysis and n-gram analysis, are
explained in detail. The inputs and outputs that the application produces are also explored.
3.1.1 Software Architecture
The following is an overview on the application flow. The application was written in Java.
There are two classes: the main “Ngram.java” and the user defined data type “Data.java”.
With a Java ready machine and class files in the current directory, the application runs with
the command “java Ngram [n]” where n is the command line argument for the size of the
segment. It then finds all command histories which are files within that directory that start
with “history-“ and end with “.txt”. Next, it reads and parses each file by replacing line
breaks with spaces and removes any line numbers in the file. Each formatted input string is
tokenised into two separate HashMaps: one for file name association and one that points to
the user defined data type Data, which contain the frequency of the commands as well as
the file name and the command line argument value of n. A CSV (Comma Separated Values)
file is prepared for shared frequency n-gram analysis, n-gram comparison, shared frequency
command analysis and command frequency comparison which are described in more detail
in subsection 3.1.3. They are placed in the “csv” folder and can be read by Microsoft Excel or
equivalent and are ready for analysis which is explicated in chapter 4. Appendix A contains
the Java code that performs this.
3.1.2 History Analysis Algorithms
A brief description of command frequency analysis is provided for contrast, which should
help form an understanding the operation of n-grams. This section then describes the
logical flow of character based n-gram analysis. For the purpose of the developed
application and therefore this research, characters are selected as the feature for n-grams,
which means the value of n will refer to the number of characters within a gram.
3.1.2.1 Command Frequency
Command frequency analysis is the tabulation of the frequency of commands, and in this
context is done on command histories. This has been implemented in the custom software
to provide a comparison with n-gram analysis which is used and later described in chapter
4. The following is a brief overview of how the command frequency analysis algorithm
operates. The first step takes an input, usually text. This text is tokenized by words and in
this way is similar to word based n-gram analysis without the variable length. The
application checks a table that stores the frequency of grams. If the current command
matches a record in the table, the frequency is incremented. If the current command hasn’t
been recorded before, a new record is created and the frequency is set to 1. It then looks at
the next command in the sequence. This process repeats until there are no commands left.
Appendix B contains more information on word n-gram analysis complete with examples
which covers very closely similar principals.
3.1.2.2 Character N-grams
N-gram analysis is the tabulation of the frequency of variable length grams, and like with
command frequency analysis it is performed on command histories. The following is an
example based description of how a character n-gram analysis algorithm operates. The first
step takes an input, usually text, such as “a_part”. It then takes the value of n and uses it to
inspect the first n-sized character segment, which if n=1 it would be “a” or if n=2 it would be
“a_”. The application checks a table that stores the frequency of grams. If the current gram
matches a record in the table, the frequency of that gram is incremented. If the current gram
hasn’t been recorded before, a new record is created and the frequency is set to 1. The
window of inspection shifts to the right one position for the next iteration, which if n=1 it
would be “_“ or if n=2 it would be “_p”. This process repeats until there are no n-sized grams
left. The entire resulting set of substrings for n=1 would be “a”, “_“, “p”, “a”, “r”, “t”, or for n=2
would be “a_“, “_p”, “pa”, “ar”, “rt”. Example 1 as below shows the flow of each iteration and
the values stored in the frequency table for an n-gram analysis where n=1 and the input is
“a_part”. Example 2 shows the same except n=2.
Example 1: Logical flow of a character 1-gram
For the string “a_part”
1st iteration: “a_part” -> Frequency of the character “a” + 1
2nd iteration: “a_part” -> Frequency of the character “_“ + 1
3rd iteration: “a_part” -> Frequency of the character “p” + 1
4th iteration: “a_part” -> Frequency of the character “a” + 1
5th iteration: “a_part” -> Frequency of the character “r” + 1
6th iteration: “a_part” -> Frequency of the character “t” +1
Frequency table:
Gram Frequency
“a” 2
“_“ 1
“P” 1
“r” 1
“t” 1
Table 1—Frequency table for a character 1-gram for the string “a_part”.
Example 2: Logical flow of a character 2-gram
For the string “a_part”
1st iteration: “a_part” -> Frequency of the gram “a_” + 1
2nd iteration: “a_part” -> Frequency of the gram “_p“ + 1
3rd iteration: “a_part” -> Frequency of the gram “pa” + 1
4th iteration: “a_part” -> Frequency of the gram “ar” + 1
5th iteration: “a_part” -> Frequency of the gram “rt” + 1
Frequency table:
Gram Frequency
“a_” 1
“_p“ 1
“pa” 1
“ar” 1
“rt” 1
Table 2—Frequency table for a character 2-gram for the string “a_part”.
3.1.3 Application Inputs and Outputs
Inputs
Input for the application includes a command history file containing commands from a
single user over an arbitrary period of time. The application finds all files that start with
“history-“ and end with “.txt”. All line breaks are replaced with spaces and all line numbers,
which are the first 4 characters in a line, are replaced with spaces. This is done to ensure the
data inputted is uniform and not tainted by irrelevant data. One other input the application
takes is the value of n. This is the length per gram used in inspecting a command history file.
Optimal values for n are discussed in section 4.2.1.
Outputs
The application produces the results of frequency analyses in CSV files. These files each
contain column titles and row titles followed by the values for that particular row. They are
able to be interpreted by Microsoft Excel or equivalent and displayed in table format. The
following describe the outputs of the application:
N-gram frequency - Shows the entire contents of the frequency of each character
gram within a command history. Format of the file name is
<historyname>.ngram<n>.csv
Command frequency - Shows the entire contents of the frequency of each
command within a command history. Format of the file name is
<historyname>.cmdfreq.csv
Shared frequency n-gram analysis - Shows a comparison between two command
histories. It contains all the similar character grams and their frequencies side by
side. Format of the file name is
<history1name>.ngram<n>_vs_<history2name>.ngram<n>.csv
N-gram table - Shows the percentage of similarity of all the command histories
compared to each other using character n-gram analysis. Performed for each <n>.
Format of the file name is ngram<n>table.csv
Shared frequency command analysis - Shows a comparison between two
command histories. It contains all the similar commands and their frequencies side
by side. Format of the file name is
<history1name>.cmdfreq_vs_<history2name>.cmdfreq.csv
Command frequency table - Shows the percentage of similarity of all the command
histories compared to each other using command frequency analysis. Format of the
file name is cmdfreqtable.csv
3.2 User Study
There were five participants involved in the study. These people were asked to extract and
provide their command history file of their nominated command line shell that supports the
command history feature. For example, this may be done in a BASH shell by using the
command “history > file name” which pipes the result into a file. The files were of arbitrary
length depending on the time frame captured for which has a minimal impact on the results;
however, it is assumed that more data can create a more accurate characterisation. These
files were stripped of unnecessary information such as line numbering and surrounding
whitespace to ensure only user behaviour is being analysed. Some users were able to
provide multiple command histories which are a result of having multiple computers they
use such as a laptop. These files were then formatted to start with “history-“ and end in
“.txt” with the participant ID in the middle. These history files were then analysed using the
custom software.
4 Results
The outputs produced by the application as described in section 3.1.3 are the precursor for
the results described in this chapter. These results intend to show that firstly, n-gram
analysis can successfully identify authors. A comparison of n-gram analysis and command
frequency analysis is shown to outline a more distinct rate of authorship similarity or
differences. Secondly, these results are able to indicate whether or not there is a value or
values for n that will produce better results. This chapter begins with the results of
command frequency analysis including an overall percentage similarity between
participants, and individual comparisons on highly similar profiles to verify that they are
likely to be the same author. Then, n-gram analysis is described which also shows an overall
percentage similarity between participants, and individual comparisons between
participants. It ends with a discussion on optimal values for n.
4.1 Command Frequency Analysis
The analysis process requires the outputs produced by the application as these indicate
how well individual authors are characterised. The application produces a list of commands
produced by the command frequency analysis entitled “<historyname>.cmdfreq.csv”. These
files contain each command with the number of times that command appears in that
particular command history. Each of these files is collated into a table of overall percentages
of similarity for all the command histories compared to each other. The similarity dictates
the percentage of commands that are similar between two histories. This is stored in the file
“cmdfreqtable.csv” and its output is shown in the table below.
Command frequency table
history-P01a.txt
history-P01b.txt
history-P02.txt
history-P03a.txt
history-P03b.txt
history-P04.txt
history-P05.txt
history-P01a.txt 100 12.72727 10.90909 10.90909 14.54545 16.36364 14.54545
history-P01b.txt 8.433735 100 10.84337 14.45783 19.27711 19.27711 8.433735
history-P02.txt 4.195804 6.293706 100 6.993007 9.79021 10.48951 4.195804
history-P03a.txt 2.419355 4.83871 4.032258 100 76.20968 6.451613 3.629032
history-P03b.txt 2.339181 4.678363 4.093567 55.26316 100 6.140351 3.216374
history-P04.txt 5.732484 10.19108 9.55414 10.19108 13.3758 100 5.732484
history-P05.txt 14.28571 12.5 10.71429 16.07143 19.64286 16.07143 100
Table 3—Table for similarity percentages for all histories compared to each other using command frequency analysis.
From this table, we can see that every history is compared to itself with 100% similarity.
Most other comparisons don’t have a discernable difference between each other. One
exception is the percentage of similarity between histories P03a and P03b, which are
abnormally high. This is explained by the fact that they are histories from the same author
(P03). This is evidence that command frequency analysis can uniquely identify authors,
which has already been established in literature.
Histories from P01 (P01a and P01b) have a low similarity percentage, which is opposite to
what was expected. The type of usage for that particular participant was explained to be
quite different. Each command history was produced on a different computer with different
purposes, and the low similarity percentage therefore makes sense.
One other point to note from the above table is that some comparisons have higher
similarity percentages with a lack of discernable similarities. The problem lies within being
able to accurately identify high similarity percentages, and as a result, it increases the
chance of false positives occurring. This is primarily caused by command frequency analysis
only looking at commands and ignoring the ordering of such. N-gram analysis aims to solve
this by using same sized blocks to analyse and considers order of the data more stringently.
An additional part of the analysis involves a further comparison of histories with high
similarity percentages. This takes place because the similarity percentage is only based on
similar commands rather than similar frequencies of commands. Therefore, a high
similarity percentage alone cannot identify authorship. The application produces additional
CSV files for a side by side comparison of the similar commands for each history versus each
other. Files are formatted as “<history1name>.cmdfreq_vs_<history2name>.cmdfreq.csv”.
This was done with the command histories “history-P03a.txt” and “history-P03b.txt” which
are stored in the file “history-P03a.txt.cmdfreq_vs_history-P03b.txt.cmdfreq.csv”. It was
opened in Microsoft Excel and a bar graph was created with unique commands on the x axis
and frequency on the y axis. The colours for the graph were carefully chosen in an attempt
to enhance visibility. The black background was specifically for outlining the foreground
colours. “History-P03a.txt” is defined by a dark blue and “history-P03b.txt” is defined by a
grey. These colours are transparent and overlapping so that command history similarity can
be shown by the combination of the two colours, which is light blue. The graph from P03a
versus P03b data is shown below.
0
20
40
60
80
100
120
140
160
180
P03a vs P03b command frequency analysis
Command
Freq
uenc
y
Figure 1—Comparison of similar grams for P03a vs. P03b using command frequency analysis.
From this graph, one can see that the data is quite sparse but there is a similarity between
both frequencies which is shown by the light blue. Since the frequencies of similar
commands are overlapping, it suggests that the two histories are probably produced by the
same author.
Given this information, it verifies current literature in that command frequency analysis is
successful at characterising authorship, but there is a potential for improvement, which is
detailed in the following section.
4.2 N-gram Analysis
The start of the n-gram analysis process is similar to the start of the command frequency
analysis. The application lists the frequency of n-grams produced by the n-gram analysis
through an outputted file called “<historyname>.ngram<n>.csv”. These files are produced
for each <n> that is chosen through command line arguments and for the purpose of this
experiment, values of n are 1, 2, 3, 4, 5, 7, 10 and 15. Every file contains each gram with the
number of times that gram appears in that particular command history. These files are used
to collate into a table of overall percentages of similarity for all the command histories
compared to each other. The similarity dictates the percentage of commands that are
similar between two histories. There is also a unique table for every n. This is stored in the
file(s) “ngram<n>table.csv” and an example output is shown for a 15-gram analysis, which
was chosen as it gave discernable results.
15-gram frequency table
history-P01a.txt
history-P01b.txt
history-P02.txt
history-P03a.txt
history-P03b.txt
history-P04.txt
history-P05.txt
history-P01a.txt 100 0.533333 0.533333 0 0 1.2 0
history-P01b.txt 0.1999 100 0.649675 0.09995 0.09995 0.349825 0
history-P02.txt 0.105932 0.34428 100 0.026483 0.370763 1.006356 0
history-P03a.txt 0 0.03581 0.017905 100 66.05192 0.07162 0
history-P03b.txt 0 0.022452 0.157162 41.41221 100 0.157162 0
history-P04.txt 0.261248 0.203193 1.103048 0.11611 0.406386 100 0
history-P05.txt 0 0 0 0 0 0 100
Table 4—Table for similarity percentages for all histories compared to each other using 15-gram analysis.
Again, we can see that every history is compared to itself with 100% similarity. The same
author comparisons (P03a and P03b) are now very distinct and non similar comparisons
have very low percentages of similarity. N-gram analysis considers ordering of commands
and additionally processes the entire document rather than a single line at a time. Not only
can n-gram analysis uniquely identify authors, but it is shown that it can reduce the chance
of false positive readings.
Like with command frequency analysis, a further comparison of histories with high
similarity percentages is required. The application produces additional CSV files for a side
by side comparison of the similar commands for each history versus each other for each n.
Files are formatted as <history1name>.ngram<n>_vs_<history2name>.ngram<n>.csv”. This
was done with the same command histories “history-P03a.txt” and “history-P03b.txt”
stored in the file “history-P03a.txt.ngram15_vs_history-P03b.txt.ngram15.csv”. The 15-gram
analysis was chosen for consistency with the above table. It was opened in Microsoft Excel
and a bar graph was created with unique commands on the x axis and frequency on the y
axis. The colours for the graph were carefully chosen in an attempt to enhance visibility. The
black background specifically outlines the foreground colours. “History-P03a.txt” is defined
by a dark blue and “history-P03b.txt” is defined by a grey. These colours are transparent
and overlapping so that command history similarity can be shown by the combination of
the two colours, which is light blue. The graph from P03a versus P03b data is shown below.
0
5
10
15
20
25
30
35
40
45
P03a vs P03b 15-gram analysis
Command
Freq
uenc
y
Figure 2—Comparison of similar grams for P03a vs. P03b using 15-gram analysis.
Compared to the command frequency graph featured in Figure 1, command data is no
longer sparse. The similarity between both frequencies is further noticeable with n-gram
analysis. Again, the frequencies of similar commands are overlapping which suggests that
the two histories are probably produced by the same author. The following graph shows the
same method comparing two dissimilar command histories (P02 versus P04). It is offered
for contrast to Figure 2 showing two same author command histories.
0
10
20
30
40
50
60
70
80
90
P02 vs P04 15-gram analysis
Commands
Freq
uenc
y
Figure 3—Comparison of similar grams for P02 vs. P04 using 15-gram analysis.
Figure 3 outlines a clear difference between a different author comparison as above and a
same author comparison as in Figure 2. Similar commands are more sparse and reduced in
number. Also, more dark blue and grey bars are evident compared to Figure 2. These graphs
side by side indicate that n-gram analysis can successfully identify authorship.
It is hence shown that n-gram analysis can be provided instead of command frequency
analysis for methods that identify authorship. They show similar results, however, n-gram
analysis has the possibility of reducing false positive readings by providing more accurate
and clearer results. N-gram analysis also has the benefit of having variable length grams
which can be tailored to suit a particular purpose and is described in the following section.
4.2.1 N-gram values for n
N-gram analysis inspects n-sized grams while processing text. Variable length grams are
beneficial as they are more dynamic allowing the algorithm to be optimised to more
accurately identify users. N-gram analysis was performed with the values 1, 2, 3, 4, 5, 7, 10
and 15, however, tables are provided for the values 1, 3, 5, 10 and 15 to avoid needless
repetition. Each are compared to each other below, starting with a 1-gram frequency table.
1-gram frequency table
history-P01a.txt
history-P01b.txt
history-P02.txt
history-P03a.txt
history-P03b.txt
history-P04.txt
history-P05.txt
history-P01a.txt 100 71.92982 94.73684 91.22807 98.24561 89.47368 80.70175
history-P01b.txt 78.84615 100 96.15385 96.15385 100 90.38462 82.69231
history-P02.txt 77.14286 71.42857 100 87.14286 95.71429 88.57143 77.14286
history-P03a.txt 77.61194 74.62687 91.04478 100 100 86.56716 79.10448
history-P03b.txt 71.79487 66.66667 85.89744 85.89744 100 83.33333 74.35897
history-P04.txt 68 62.66667 82.66667 77.33333 86.66667 100 74.66667
history-P05.txt 76.66667 71.66667 90 88.33333 96.66667 93.33333 100
Table 5—Table for similarity percentages for all histories compared to each other using 1-gram analysis.
As expected, all the comparisons have quite high similarity percentages, which reduce the
ability to differentiate authors. There is a high chance of false positive readings. The
following table shows a 3-gram analysis.
3-gram frequency table
history-P01a.txt
history-P01b.txt
history-P02.txt
history-P03a.txt
history-P03b.txt
history-P04.txt
history-P05.txt
history-P01a.txt 100 19.67593 17.82407 26.62037 32.63889 23.37963 17.36111
history-P01b.txt 12.72455 100 20.20958 32.78443 40.26946 24.8503 12.72455
history-P02.txt 7.077206 12.40809 100 24.08088 31.61765 22.33456 8.180147
history-P03a.txt 7.871321 14.98973 17.93292 100 86.51608 18.61739 10.74606
history-P03b.txt 7.186544 13.7105 17.53313 64.42406 100 18.24669 10.34659
history-P04.txt 10.28513 16.90428 24.74542 27.69857 36.45621 100 10.28513
history-P05.txt 17.77251 20.14218 21.09005 37.20379 48.10427 23.93365 100
Table 6—Table for similarity percentages for all histories compared to each other using 3-gram analysis.
The same author comparisons (P03a and P03b) are starting to stand out, but as with
command frequency analysis there is a relatively high chance of false positive readings still.
The following table shows a 5-gram analysis.
5-gram frequency table
history-P01a.txt
history-P01b.txt
history-P02.txt
history-P03a.txt
history-P03b.txt
history-P04.txt
history-P05.txt
history-P01a.txt 100 7.21831 5.809859 6.514085 7.570423 7.922535 6.338028
history-P01b.txt 4.120603 100 4.723618 9.849246 13.46734 7.839196 5.125628
history-P02.txt 1.98915 2.833032 100 6.871609 10.72936 8.438819 2.049427
history-P03a.txt 1.45612 3.856749 4.486423 100 79.53562 4.565132 4.28965
history-P03b.txt 1.182293 3.684355 4.894144 55.56778 100 5.389057 3.876822
history-P04.txt 2.853519 4.9461 8.877616 7.355739 12.42866 100 3.360812
history-P05.txt 6.196213 8.777969 5.851979 18.76076 24.2685 9.122203 100
Table 7—Table for similarity percentages for all histories compared to each other using 5-gram analysis.
5-gram analysis shows improvement over the 3-gram and 1-gram analysis but still has the
same problem as with command frequency analysis where some comparisons have higher
similarity percentages than expected. There is still a chance for false positives. The
following table shows a 10-gram analysis.
10-gram frequency table
history-P01a.txt
history-P01b.txt
history-P02.txt
history-P03a.txt
history-P03b.txt
history-P04.txt
history-P05.txt
history-P01a.txt 100 2.832861 1.699717 0.424929 0.141643 3.399433 0.991501
history-P01b.txt 1.244555 100 1.1201 1.369011 1.493466 2.240199 0.746733
history-P02.txt 0.42523 0.637845 100 0.496102 1.311127 2.445074 0.070872
history-P03a.txt 0.068043 0.498979 0.317532 100 70.65094 0.635065 0.181447
history-P03b.txt 0.014747 0.35393 0.545642 45.93718 100 0.943814 0.221206
history-P04.txt 0.889878 1.334816 2.558398 1.038191 2.373007 100 0.667408
history-P05.txt 0.832342 1.426873 0.237812 0.951249 1.783591 2.140309 100
Table 8—Table for similarity percentages for all histories compared to each other using 10-gram analysis.
10-gram analysis shows a vast improvement over command frequency analysis results.
There is a low chance of false positive readings. The following table shows a 15-gram
analysis.
15-gram frequency table
history-P01a.txt
history-P01b.txt
history-P02.txt
history-P03a.txt
history-P03b.txt
history-P04.txt
history-P05.txt
history-P01a.txt 100 0.533333 0.533333 0 0 1.2 0
history-P01b.txt 0.1999 100 0.649675 0.09995 0.09995 0.349825 0
history-P02.txt 0.105932 0.34428 100 0.026483 0.370763 1.006356 0
history-P03a.txt 0 0.03581 0.017905 100 66.05192 0.07162 0
history-P03b.txt 0 0.022452 0.157162 41.41221 100 0.157162 0
history-P04.txt 0.261248 0.203193 1.103048 0.11611 0.406386 100 0
history-P05.txt 0 0 0 0 0 0 100
Table 9—Table for similarity percentages for all histories compared to each other using 15-gram analysis.
15-gram analysis now shows unique authors more definitively. Great similarities are
required to identify authorship and non similar comparisons have low percentages as they
should. The false positive rate should be decreased significantly.
A trend is seen each time the value of n increments: significant similarities between
command histories become more clear which reduces false positive readings. While this
data suggests that a higher n will produce better results, it must be noted that an n that
exceeds the command history length will reduce accuracy, which is noted in previous
literature (Peng, F, Schuurmans, D, Wang, S & Keselj, V 2003). The variable length n allows
n-gram analysis to be flexible and customisable and is therefore an additional improvement.
5 Future Work
There are some opportunities for continuing this work in the future in ways that still
haven’t been explored. The main future application for this work is integration with a
behavioural IDS. Additionally, a live analysis of incoming keystrokes would be an interesting
and useful endeavour. One flaw in the current design is that the similarity is based on
similar grams and doesn’t take into account the frequency of grams. If the initial analysis
process could consider the frequency of grams, it could reduce the extra steps for
comparing similar grams. Furthermore, optimisation of the application code would be
useful as the analysis process does take time to complete, and running an analysis on an
increased number of command histories at a time would further exacerbate this problem.
Other facets that could be explored include whitespace trimming, considering case
sensitivity, and algorithm optimisation for better results.
6 Conclusion
Empirical data has shown that n-gram analysis can produce more accurate and easily
recognisable results than command frequency analysis for authorship identification. An
experiment was conducted which used custom software to perform an analysis on
command line histories with n-gram analysis and command frequency analysis. These
methods both tabulated the frequency of every command or n-gram. The result was used to
form a table of the similarity percentage for all histories compared to each other, which in
turn was used to gauge the effectiveness of command frequency analysis versus n-gram
analysis. N-gram analysis is also shown to have the advantage of flexibility due to the
variable gram length. Generally, increasing the gram length allows same authors to become
more identifiable and reduces the chance of false positive readings. N-gram analysis can be
provided as an alternative algorithm to command frequency analysis in a behavioural IDS as
they have closely similar functionality, however that is yet to be explored.
7 References
Abou-Assaleh, T, Cercone, N, Kešelj, V & Sweidan, R 2004, 'N-Gram-Based Detection of New
Malicious Code'.
Balajinath, B & Raghavan, SV 2001, 'Intrusion detection through learning behavior model',
Computer Communications, vol. 24, no. 12, pp. 1202-1212.
Cavnar, WB & Trenkle, JM 1994, 'N-Gram-Based Text Categorization', Proceedings of SDAIR-
94, 3rd Annual Symposium on Document Analysis and Information Retrieval, New York,
USA.
Houvardas, J & Stamatatos, E 2006, 'N-Gram Feature Selection for Authorship
Identification', 12th International Conference on Artificial Intelligence : Methodology,
Systems, and Applications, Varna, Bulgaria.
Kolter, JZ & Maloof, MA 2004, 'Learning to detect malicious executables in the wild',
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery
and data mining, Seattle, WA, USA.
Lane, T & Brodley, CE 1999, 'Temporal sequence learning and data reduction for anomaly
detection', ACM Trans. Inf. Syst. Secur., vol. 2, no. 3, pp. 295-331.
Maxion, RA 2003, 'Masquerade detection using enriched command lines', Dependable
Systems and Networks, 2003. Proceedings.
Peng, F, Schuurmans, D, Wang, S & Keselj, V 2003, 'Language independent authorship
attribution using character level language models', Proceedings of the tenth conference on
European chapter of the Association for Computational Linguistics - Volume 1, Budapest,
Hungary.
Soboroff, IM, Nicholas, CK, Kukla, JM & Ebert, DS 1997, 'Visualizing document authorship
using n-grams and latent semantic indexing', Proceedings of the 1997 workshop on New
paradigms in information visualization and manipulation, Las Vegas, Nevada, United States.
Tan, K 1995, 'The application of neural networks to UNIX computer security', Neural
Networks, 1995. Proceedings., IEEE International Conference.
Appendix A
Data.javaimport java.util.*;
public class Data {
public String name;
public HashMap<String, Integer> ngrams;
public HashMap<String, Integer> commandFreq;
public int ngram;
public Data(String name, HashMap<String, Integer> ngrams, HashMap<String, Integer> commandFreq, int ngram) {
this.name = name;
this.ngrams = ngrams;
this.commandFreq = commandFreq;
this.ngram = ngram;
}
}
Ngram.javaimport java.io.*;
import java.util.*;
public class Ngram {
private static final boolean DEBUG = false;
private static ArrayList<Data> data = new ArrayList<Data>();
public static void main(String[] args) {
int n = 0;
// 1 Argument (n-gram)
if (args.length == 1) {
try {
n = Integer.parseInt(args[0]);
}
catch (NumberFormatException nfe) {
System.out.println("Could not parse n because it is not an Integer.");
}
// Find all history files: history-*.txt
FilenameFilter filenameFilter = new FilenameFilter() {
public boolean accept(File dir, String name) {
return name.endsWith(".txt") && name.startsWith("history-");
}
};
File currDir = new File(".");
String[] filenames = currDir.list(filenameFilter);
// Populate n-grams and command frequencies
for (int i = 0; i < filenames.length; i++) {
String fileName = filenames[i];
String fileContents = readFile(fileName);
// Original Contents (line breaks == spaces)
String orig = fileContents.replaceAll("\n", " ");
orig = orig.replaceAll("\r", "");
// No Line Numbers
String noLn = orig.replaceAll("[ ]+[0-9]{1,4}[ ]{2}", " ");
HashMap<String, Integer> ngramMap = ngram(noLn, n, fileName);
HashMap<String, Integer> commandFreqMap = commandFrequencies(noLn, fileName);
data.add(new Data(fileName, ngramMap, commandFreqMap, n));
}
// N-Gram Table
String analysisCsv = "\"\",";
for (int i = 0; i < data.size(); i++) {
analysisCsv += "\"" + data.get(i).name + "\"";
if (i != data.size() - 1)
analysisCsv += ",";
}
analysisCsv += "\r\n";
for (int i = 0; i < data.size(); i++) {
analysisCsv += "\"" + data.get(i).name + "\",";
for (int j = 0; j < data.size(); j++) {
analysisCsv += "\"" + sharedKeys(data.get(i).ngrams, data.get(j).ngrams) + "\"";
sharedFreqNgramAnalysis(data.get(i), data.get(j));
if (j != data.size() - 1)
analysisCsv += ",";
}
analysisCsv += "\r\n";
}
writeFile("csv" + "/" + "ngram" + n + "table.csv", analysisCsv);
// Command Frequency Table
analysisCsv = "\"\",";
for (int i = 0; i < data.size(); i++) {
analysisCsv += "\"" + data.get(i).name + "\"";
if (i != data.size() - 1)
analysisCsv += ",";
}
analysisCsv += "\r\n";
for (int i = 0; i < data.size(); i++) {
analysisCsv += "\"" + data.get(i).name + "\",";
for (int j = 0; j < data.size(); j++) {
analysisCsv += "\"" + sharedKeys(data.get(i).commandFreq, data.get(j).commandFreq) + "\"";
sharedFreqCmdAnalysis(data.get(i), data.get(j));
if (j != data.size() - 1)
analysisCsv += ",";
}
analysisCsv += "\r\n";
}
writeFile("csv" + "/" + "cmdfreqtable.csv", analysisCsv);
}
else {
System.out.println("Usage: java Ngram [n]");
}
}
private static double sharedKeys(HashMap<String, Integer> i, HashMap<String, Integer> j) {
Set c = i.keySet();
Iterator iter = c.iterator();
int count = 0;
while (iter.hasNext()) {
String key = (String)iter.next();
if (j.containsKey(key)) {
count++;
}
}
return (((double)count / (double)i.size()) * (double)100);
}
private static void sharedFreqNgramAnalysis(Data i, Data j) {
Set c = i.ngrams.keySet();
Iterator iter = c.iterator();
int count = 0;
String csv = "\"n-gram\",\"" + i.name + "\",\"" + j.name + "\"\r\n";
while (iter.hasNext()) {
String key = (String)iter.next();
if (j.ngrams.containsKey(key)) {
String keyParsed = key.replaceAll("\"", "\"\"");
keyParsed = "=\"\"" + keyParsed + "\"\"";
csv += "\"" + keyParsed + "\",\"" + i.ngrams.get(key) + "\",\"" + j.ngrams.get(key) + "\"\r\n";
}
}
writeFile("csv" + "/" + i.name + ".ngram" + i.ngram + "_vs_" + j.name + ".ngram" + j.ngram + ".csv", csv);
}
private static void sharedFreqCmdAnalysis(Data i, Data j) {
Set c = i.commandFreq.keySet();
Iterator iter = c.iterator();
int count = 0;
String csv = "\"n-gram\",\"" + i.name + "\",\"" + j.name + "\"\r\n";
while (iter.hasNext()) {
String key = (String)iter.next();
if (j.commandFreq.containsKey(key)) {
String keyParsed = key.replaceAll("\"", "\"\"");
keyParsed = "=\"\"" + keyParsed + "\"\"";
csv += "\"" + keyParsed + "\",\"" + i.commandFreq.get(key) + "\",\"" + j.commandFreq.get(key) + "\"\r\n";
}
}
writeFile("csv" + "/" + i.name + ".cmdfreq" + "_vs_" + j.name + ".cmdfreq.csv", csv);
}
private static HashMap<String, Integer> commandFrequencies(String fileContents, String fileName) {
HashMap<String, Integer> map = new HashMap<String, Integer>();
String csv = "\"Command/Argument\",\"Frequency\"\r\n";
String[] rawSplit = fileContents.split("(\"[(^\")\\s]*\")|([\\s]+)");
for (int i = 0; i < rawSplit.length; i++) {
if (!map.containsKey(rawSplit[i])) {
map.put(rawSplit[i], 1);
}
else {
int count = map.get(rawSplit[i]);
map.put(rawSplit[i], ++count);
}
}
Set c = map.keySet();
Iterator iter = c.iterator();
while (iter.hasNext()) {
String key = (String)iter.next();
if (DEBUG)
System.out.println(key + "\t" + map.get(key));
String keyParsed = key.replaceAll("\"", "\"\"");
keyParsed = "=\"\"" + keyParsed + "\"\"";
csv += "\"" + keyParsed + "\",\"" + map.get(key) + "\"\r\n";
}
writeFile("csv" + "/" + fileName + ".cmdfreq.csv", csv);
return map;
}
private static HashMap<String, Integer> ngram(String fileContents, int n, String fileName) {
HashMap<String, Integer> map = new HashMap<String, Integer>();
String csv = "\"n-gram\",\"Frequency\"\r\n";
for (int i = 0; i <= fileContents.length() - n; i++) {
String gram = "";
for (int j = i; j < i + n; j++) {
gram += fileContents.charAt(j);
}
if (!map.containsKey(gram)) {
map.put(gram, 1);
}
else {
int count = map.get(gram);
map.put(gram, ++count);
}
}
Set c = map.keySet();
Iterator iter = c.iterator();
while (iter.hasNext()) {
String key = (String)iter.next();
if (DEBUG)
System.out.println(key + "\t" + map.get(key));
String keyParsed = key.replaceAll("\"", "\"\"");
keyParsed = "=\"\"" + keyParsed + "\"\"";
csv += "\"" + keyParsed + "\",\"" + map.get(key) + "\"\r\n";
}
writeFile("csv" + "/" + fileName + ".ngram" + n + ".csv", csv);
return map;
}
private static String readFile(String fileName) {
FileReader fr = null;
BufferedReader br = null;
String s = "";
try {
fr = new FileReader(fileName);
br = new BufferedReader(fr);
String line = null;
while ((line = br.readLine()) != null) {
s += line + "\n";
}
}
catch (IOException ioe) {
System.out.println("Unable to read file: " + fileName);
}
finally {
try {
if (br != null)
br.close();
if (fr != null)
fr.close();
}
catch(IOException ioe) { }
}
return s;
}
private static void writeFile(String fileName, String fileContents) {
FileWriter fw = null;
try {
fw = new FileWriter(fileName);
fw.write(fileContents);
}
catch (IOException ioe) {
System.out.println("Unable to write file: " + fileName);
}
finally {
try {
if (fw != null)
fw.close();
}
catch(IOException ioe) { }
}
}
}
Appendix B
The following is an example based description of how a word n-gram analysis algorithm
operates. The first step takes an input, usually text, such as “a part of distance”. It then takes
the value of n and uses it to inspect the first n-sized word segment, which if n=1 it would be
“a” or if n=2 it would be “a part”. The application checks a table that stores the frequency of
grams. If the current gram matches a record in the table, the frequency of that gram is
incremented. If the current gram hasn’t been recorded before, a new record is created and
the frequency is set to 1. The window of inspection shifts to the right one position for the
next iteration, which is n=1 it would be “part“ or if n=2 it would be “part of”. This process
repeats until there are no n-sized grams left. The entire resulting set of substrings for n=1
would be “a”, “part”, or for n=2 it would be “a part”, “part of”, “of distance”.
Example 1: Logical flow of a character 1-gram
For the string “a part”
1st iteration: “a part” -> Frequency of the character “a” + 1
2nd iteration: “a part” -> Frequency of the character “part“ + 1
Frequency table:
Gram Frequency
“a” 1
“part“ 1
Example 2: Logical flow of a character 2-gram
For the string “a part”
1st iteration: “a part” -> Frequency of the character “a part” + 1
Frequency table:
Gram Frequency
“a part” 1