NEW OUTLIER DETECTION TECHNIQUES FOR DATA ......Bobby B. Lyle School of Engineering Southern...

145
NEW OUTLIER DETECTION TECHNIQUES FOR DATA STREAMS Approved by: Dr. Michael Hahsler Dr. Margaret H. Dunham Dr. Sukumaran Nair Dr. Jeff Tian Dr. Ping Gui

Transcript of NEW OUTLIER DETECTION TECHNIQUES FOR DATA ......Bobby B. Lyle School of Engineering Southern...

  • NEW OUTLIER DETECTION

    TECHNIQUES FOR DATA STREAMS

    Approved by:

    Dr. Michael Hahsler

    Dr. Margaret H. Dunham

    Dr. Sukumaran Nair

    Dr. Jeff Tian

    Dr. Ping Gui

  • NEW OUTLIER DETECTION

    TECHNIQUES FOR DATA STREAMS

    A Dissertation Presented to the Graduate Faculty of the

    Bobby B. Lyle School of Engineering

    Southern Methodist University

    in

    Partial Fulfillment of the Requirements

    for the degree of

    Doctor of Philosophy

    with a

    Major in Computer Science

    by

    Charlie Isaksson

    (M. S. C. S., Mid Sweden University, 2006)

    Dec 17, 2016

  • ACKNOWLEDGMENTS

    I am truly humbled and grateful for the great number of individuals that has supported

    and encouraged me over the past nine years to fulfill my biggest dream. Dr. Margaret H.

    Dunham and Dr. Michael Hahsler have been my two mentors and friends throughout this

    rewarding journey. I would like to extend special thanks to Dr. Hahsler for helping me find

    my path back and recognize your background knowledge and patience.

    I would like to extend my gratitude to the faculty and staff members in the Department

    of Computer Science and Engineering at Southern Methodist University.

    For the other members of my dissertation committee, Dr. Sukumaran Nair, Dr. Jeff Tian

    and Dr. Ping Gui, thank you for all the feedback and patience you had with me. I whole-

    heartedly enjoyed the challenge of researching a critical issue that currently is important

    for various industries.

    Finally, a special recognition goes out to my family and friends who supported and

    encouraged me during my pursuit of the doctorate in computer science. Thanks to my kids

    for giving me the strength to keep going. I love you more than you will ever know.

    iii

  • Isaksson , Charlie M. S. C. S., Mid Sweden University, 2006

    New Outlier Detection

    Techniques For Data Streams

    Advisor: Professor Michael Hahsler

    Doctor of Philosophy degree conferred Dec 17, 2016

    Dissertation completed Nov 09, 2016

    The availability and reliability of data have become essential in our modern society. In

    fact, it has become critical in every domain to maintain high-quality data even though that

    data may originate at high velocity and in large quantities. Today it is well understood that

    data enables businesses to achieve their full potential by providing valuable insights into

    their business as well as potentially offering them an advantage over their competitors. To

    achieve such a goal requires a significant investment in both big data infrastructure and data

    mining capabilities. Data Mining is the process of finding hidden patterns within a large

    dataset. Imperative to Data Mining is the ability to detect outliers, data points that deviate

    from the rest of the data points because outliers can dramatically alter the result of the anal-

    ysis. Although outliers occur infrequently, it is hard to identify them since there are many

    potential sources for outliers (such as human errors, machine errors, environmental varia-

    tions, faulty sensors). Finding outliers in large dataset requires extremely efficient outlier

    detection techniques. It becomes even harder to detect an outlier within a data stream as it

    imposes a single pass restriction and data often arrives at a very fast rate. Also streaming

    data may contain redundant information, which can reduce outlier detection performance

    and efficiency. To avoid this redundancy while maintaining the correctness of the data, it

    becomes necessary to summarize the data stream. The Extensible Markov Model (EMM)

    has been proven to be a good candidate for meeting these requirements to detect outliers in

    iv

  • data stream applications. EMM uses data stream clustering models and takes into account

    temporal and ordering aspects using a Markov Chain (MC), a powerful temporal model that

    allows studying a complex system and making predictions about events. Extensible Markov

    model is a time changing MC that has the ability of learning and dynamically adapting its

    structure to the environment as well as updating the state transition probability based on the

    incoming data. The model generated by EMM allows analysis of a particular time frame

    as an MC, and, as time passes, this model will continue to adapt, evolve, and learn with the

    ongoing data stream. This is due to the close coupling of the clustering model with an MC

    model. Combining these two models delivers a spatiotemporal model that satisfies all the

    requirements from a data stream (big data) infrastructure standpoint. In this dissertation,

    the data pattern finding capability of EMM has been extended in several ways. Firstly, a

    sophisticated mining task on the synopsis is investigated to detect Distributed Denial of

    Service (DDOS) network intrusion. A performance study is then conducted of different

    outlier detection techniques and compared with EMM, and this leads to two additional ex-

    tensions to further improve EMM’s performance. SO-Stream, a new self-organizing cluster

    structure that allows the algorithm to obtain the threshold for each micro-cluster dynami-

    cally, is proposed, and then SO-Stream is extended by integrating a Markov Model (MM).

    The new algorithm is called Adaptive Streaming Markov Model (ASMM), which is de-

    signed to handle concept drift, spatiotemporal outliers, and high volume and velocity data

    streams while preserving higher accuracy and cluster quality. The dissertation concludes

    with directions for future work including distributed ASMM’s that can be integrated into

    big data frameworks, ASMM’s for telecom applications and a visualization technique for

    multidimensional data that is greatly needed for better interpretation of outlier models.

    v

  • TABLE OF CONTENTS

    LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

    LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

    CHAPTER

    1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2. Focus of the Dissertation and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3. Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.1. Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.2. Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.3. Outlier Detection in Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.4. Spatiotemporal Outlier Detection in Data Stream . . . . . . . . . . . . . . . . . . . . . . . . 13

    3. RISK LEVELING OF NETWORK TRAFFIC ANOMALIES . . . . . . . . . . . . . . . . . 17

    3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.2. Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.4. Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.4.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.4.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.5. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4. A COMPARATIVE STUDY OF OUTLIER DETECTION ALGORITHMS . . . . 36

    vi

  • 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    4.1.1. Extensible Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.1.2. Density Based Local Outliers (LOF Approach) . . . . . . . . . . . . . . . . . . 41

    4.1.3. Density Based Local Outliers (LSC-Mine Approach) . . . . . . . . . . . . 41

    4.2. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    4.2.1. Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    4.2.2. Experiments on Real Life Data and Synthetic Datasets . . . . . . . . . . . 45

    4.3. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5. SO-STREAM: SELF ORGANIZING DENSITY-BASED CLUSTERINGOVER DATA STREAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    5.2. Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    5.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    5.3.1. SOStream Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    5.3.2. Density-Based Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    5.3.3. SOStream Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    5.3.4. Online Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    5.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.4.1. Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.4.2. Real-World Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    5.4.3. Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    5.4.4. Scalability and Complexity of SOStream . . . . . . . . . . . . . . . . . . . . . . . . 76

    5.5. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    vii

  • 6. ASMM: DETECTING SPETIO-TEMPORAL OUTLIERS WITH ADAP-TIVE STREAMING MARKOV MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    6.2. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    6.2.1. Extensible Markov Model Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    6.2.2. Adaptive Streaming Markov Model Algorithm . . . . . . . . . . . . . . . . . . 90

    6.2.3. EMMRare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

    6.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    6.3.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    6.3.2. Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    6.3.3. Scalability and Complexity of ASMM . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    6.4. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    7. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    7.1. Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    7.2. Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    APPENDIX

    REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    viii

  • LIST OF FIGURES

    Figure Page

    2.1 A classification of outlier detection techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.2 A workflow from a traditional spatiotemporal outlier detection framework. . . 14

    2.3 The workflow from EMM outlier detection framework. . . . . . . . . . . . . . . . . . . . . 16

    3.1 Logarithm of traffic volume shows the DDoS attacks . . . . . . . . . . . . . . . . . . . . . . 33

    4.1 Advantages of the LOF approach. Modified from [78] . . . . . . . . . . . . . . . . . . . . . 42

    4.2 Run time for LOF, LSC-Mine, and EMM with MinPts=20 and EMMThreshold=0.99. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    5.1 (a) Data points of stream with 5 overlapping clusters and (b) Show SOStreamcapability to distinguish overlapped clusters. For visualizing clusterstructure, we do not utilize Fading or Merging. . . . . . . . . . . . . . . . . . . . . . . . 72

    5.2 SOStream clustering quality horizon = 1K, Stream speed = 1K. The qual-ity Evaluation for MR-Stream and D-Stream is retrieved from [68]. . . . . . 73

    5.3 SOStream memory cost over the length of the data stream. The MemoryEvaluation for MR-Stream is retrieved from [68]. . . . . . . . . . . . . . . . . . . . . . 76

    5.4 SOStream execute time using high dimensional KDD CUP99 datasetwith 34 numerical attributes. The sampling data rate is every 25K points. 77

    6.1 Example of EMM directed graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    6.2 Basic example that show high level operations of ASMM. . . . . . . . . . . . . . . . . . 90

    6.3 The Sensors were arranged in the lab according to the above diagram.Obtained from [88] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    6.4 Show subplots from time period [8398:9000]. It is evident that humiditysuffers from spatial outlier. However, due to large data size we areunable to display temporal outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    ix

  • 6.5 Show subplots from normalized Server System Health dataset. Hence,the highlighted red area include both the spatial and temporal outliers. . . 110

    6.6 Distribution of ASMM’s clusters count based on different buffer size forthe KDD CUP’99 data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    6.7 Clusters size decreases with increased buffer size. The number of clustersstabilizes between buffer size 15 to 35. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    6.8 ASMM and EMM memory cost over different threshold values usingKDD CUP99 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    6.9 ASMM and EMM execute time using high dimensional KDD CUP99 dataset. 117

    x

  • LIST OF TABLES

    Table Page

    3.1 Notations of EMM Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.2 The extracted features from raw tcpdump data using tcptrace software . . . . . . 32

    3.3 Legend used in the performance evaluation with derivations from theconfusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.4 Impacts of clustering thresholds and selection of similarity measures . . . . . . . . 35

    3.5 Detection rate and false alarm rate using frequency based anomaly de-tection model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.6 Detection rate and false alarm rate using risk leveling anomaly detectionmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4.1 EMM detection and false positive rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    4.2 LOF detection and false positive rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    4.3 LSC-Mine detection and false positive rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    4.4 EMM, LOF and LSC-Mine detection and false positive rates using PCA. . . . . 54

    4.5 EMM, LOF and LSC-Mine detection and false positive rates. . . . . . . . . . . . . . . 55

    5.1 Feature comparison between different data stream clustering algorithms. . . . . 58

    5.2 Comparing average purity for different MinPts for α = 0.1. . . . . . . . . . . . . . . . . 75

    5.3 Comparing average purity for different MinPts for α = 0.3. . . . . . . . . . . . . . . . . 75

    5.4 Highlight the improvement SOStream compared to MR-Stream and D-Stream. 75

    6.1 Show the legend from the confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    6.2 ASMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    xi

  • 6.3 EMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    6.4 LOF’s outlier detection results over different threshold values and differ-ent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    6.5 ASMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

    6.6 EMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    6.7 LOF’s outlier detection results over different threshold values and differ-ent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    6.8 ASMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    6.9 EMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    6.10 LOF’s outlier detection results over different threshold values and differ-ent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    xii

  • Dedicated to the Almighty Creator, the Most Gracious, the Most Merciful.

  • Chapter 1

    INTRODUCTION

    Availability and reliability of data have become crucial factors in today’s modern so-

    ciety. One important task for any domain application is to detect abnormal data. Outlier

    detection is extensively used in a wide variety of applications such as fraud detection in

    the banking system, intrusion detection in network security, unusual behavior in military

    surveillance, and also detection of tumors in MRI images. Outliers are defined as data

    points that occur very infrequently and/or lie far from the expected values. It is crucial to

    investigate outliers because they may contain valuable information regarding the process

    under investigation. One should inquire why such data points have occurred and whether

    similar points would continue to appear before taking the decision of removing them from

    the dataset before training models. Statisticians have researched the problem of outlier de-

    tection since the early nineteenth century [49]. There have been many techniques proposed

    for outlier detection. Out of those techniques, some are specifically designed to suit certain

    application domains while others are more generic. The presence of outliers in data may

    carry important information. For example, an anomaly in digital photography may indicate

    that a terrorist is using steganography to hide messages in the low-order bits of a digital

    photograph in either plaintext or ciphertext form to disguise it from their enemies [34].

    Similarly, outliers in Magnetic Resonance Imaging (MRI) may identify pixels, which are

    significantly different between two MRI scans and thereby indicate the presence of brain

    tumors [54]. Furthermore, an abnormal pattern in network traffic may signal an alarm of

    intrusion, which may indicate a compromised server is sending out unauthorized informa-

    tion [87]. Other examples could be outliers in credit card transactions, which may raise

    1

  • attention to a credit card theft [39] or interruptions in continuous signals from airplane to

    the ground due to inconsistent data acting as outliers, which may lead to accidents.

    Outliers may arise due to several reasons such as intrusion, human error, machine error,

    and changes in the behavior of the system. Because of all these various causes, outliers are

    difficult to detect. For example, attempting to define a normal region, which includes all

    possible behaviors is very problematic. Furthermore, it is difficult to set a precise boundary

    distinguishing outlier and normal behavior. This may result in a case where an outlier

    lying close to the boundary may be predicted as normal, or, on the contrary, normal data

    lying close to the boundary may be identified as an outlier. There are also cases where

    malicious actions result in outliers. Malicious actors may try to adapt themselves in such a

    way that resulting outliers seem to appear normal thereby making it difficult to distinguish

    between normal and malicious behavior. In addition to this, it becomes difficult to detect,

    distinguish, and remove data consisting of noise, which may appear to be similar to the real

    outliers.

    1.1. Motivation

    Existing outlier detection techniques have been effective in either space or time but not

    both. The Extensible Markov Model (EMM) [76] is a spatiotemporal algorithm that has

    been successfully used in diverse fields. [77, 28, 120, 118, 119] EMM has proven to be a

    powerful algorithm that can administrate space and time very efficiently as well as adapt to

    continuous changes in the environment in a scalable manner. When EMM is used to pro-

    cess data, it will dynamically construct a codebook based on the input data. The codebook

    consists of a set of model vectors representing typical vectors within the dataset. EMM

    creates new entries in the codebook based on a fixed threshold when the input data does

    not map to an existing cluster. The use of a spatiotemporal data mining algorithm like

    EMM allows continuous assessment and is capable of both tracking changes over time and

    2

  • determining whether or not that particular change is probable based on a normal or abnor-

    mal pattern. Other outlier detection algorithms that are based only on clustering would be

    incapable of establishing such relationships because they lack the temporal model.

    1.2. Focus of the Dissertation and Conclusions

    This work presents several innovative data mining models for outlier detection based on

    Extensible Markov Model (EMM) [76] which combine the spatiotemporal data modeling

    with data streams. EMM is composed of two core features: modeling and pattern-finding

    capabilities. The modeling component in EMM is used to group related data points into

    clusters. EMM combines a clustering model with a Markov Model (MM). Many algorithms

    are proposed to extend EMM, and their performance is discussed for a large number of

    datasets.

    The main contributions of this work can be summarized as follows.

    1. Risk leveling of network traffic anomalies: A real world application to explore so-

    phisticated mining tasks. False alarm rate is being used for performance evaluation.

    Discussed in Chapter 3.

    2. Comparison of EMM with other state-of-the-art outlier detection techniques: LOF

    and LSC Mine. The comparison is based on accuracy and runtime complexity. The

    research indicates that EMM outperformed the other two techniques in several cases.

    However, EMM suffered from a critical issue concerning the clustering component

    using a fixed threshold. Discussed in Chapter 4.

    3. For the EMM algorithm to work efficiently, it is imperative that the threshold be set

    to the correct value. The threshold is the static parameter that determines whether a

    new event belongs to an existing cluster or if a new cluster should be created. If this

    parameter is set to an unsuitable value, then the algorithm will create too many clus-

    ters and suffer from overfitting, or it will create too few clusters resulting in unstable

    3

  • classification. The next step is to add to EMM the ability of an adaptable threshold.

    Self-Organizing Map (SOM) [105], is an unsupervised algorithm that does not use

    a fixed threshold, but it creates an approximation output space with randomly as-

    signed weight, and depending on the incoming data it will adjust the neighbors of the

    winning weight to be closer to it. SO-Stream proposes a clustering algorithm that dy-

    namically self-organizes its structure without the use of a fixed threshold. SO-Stream

    is designed specifically for the data stream, so its performance was tested against

    two popular stream clustering algorithms, MR-Stream and D-Stream with different

    real-world and synthetic datasets. SO-Stream outperformed the other two techniques

    concerning cluster purity, memory, and runtime complexity. SO-Stream can identify

    highly overlapping clusters, and SO-Stream operations (i.e. create, remove, merge,

    and fade) are completely online. Discussed in Chapter 5.

    4. EMM’s two components: clustering and MM are tightly coupled and data points have

    to be processed in order. We proposed a new algorithm ASMM that utilizes an offline

    and online component that decouples these two elements. ASMM can handle points

    out of order while maintaining the original order of the incoming data points. SO-

    Stream utilizes an offline component to initialize the clustering model, and then the

    initial model is efficiently incremented with the online component. The offline com-

    ponent uses a buffering technique to add support for concept drift as well new pat-

    terns that may emerge from evolving stream. ASMM performance is tested against

    two popular outlier detection algorithms: LOF and EMMRare on a large number of

    real-world and synthetic datasets. ASMM outperformed the other two techniques

    in term of different confusion matrix measures, memory and runtime complexity.

    Discussed in Chapter 6.

    1.3. Organization of the Dissertation

    The remainder of the dissertation is organized as follows: Chapter 2 presents the back-

    4

  • ground work; Chapter 3 proposes a spatiotemporal technique for outlier detection in data

    streams; Chapter 4 provides a comparative study of techniques to detect outliers; Chapter

    5 presents a novel unsupervised clustering technique for data streams. Chapter 6 presents a

    spatiotemporal model that extends SO-Stream with a temporal Markov Model, and Chap-

    ter 7 concludes with an assessment of the viability of stream mining for outlier detection.

    Please note that each thesis chapter represents a previously published/submitted research

    paper, and due to this reason, some concepts are introduced recurrently in the introduction

    sections from various chapters.

    5

  • Chapter 2

    BACKGROUND

    In this chapter, the previous research related to spatiotemporal data stream mining is

    addressed. Firstly, general techniques used in the areas of outlier detection are reviewed,

    and then three important areas are discussed: Data streams, outlier detection in the data

    stream, and spatiotemporal outlier detection in data streams.

    2.1. Outlier Detection

    Outlier Detection Techniques

    Clustering-based

    Nearest Neighbor-based

    Statistical-based

    Classification-based

    Spectral Decomposition-based

    Principal Component Analysis

    Support Vector Machine

    Bayesian Network

    Figure 2.1: A classification of outlier detection techniques

    Different approaches and methodologies have been introduced to address the outlier/

    anomaly detection problem; they include statistical approaches, supervised and unsuper-

    vised learning techniques, neural networks and machine learning techniques (See Fig-

    ure 2.1). We can not provide a complete survey here but refer the interested reader to

    available surveys [109], [71], [110]. We briefly mention some representative techniques1.

    6

  • The Grubbs method (extreme studentized deviate) [50] introduced a one-dimensional sta-

    tistical method in, which all parameters are derived from the data, it requires no user pa-

    rameters. It calculates the mean and standard deviation from all attribute values and then

    calculates a Z-score as the difference between the mean value of the attribute and the query

    value divided by the standard deviation for the attribute. Then the Z-score for the query is

    compared with a threshold of 1% or 5% significance level.

    An optimized k-NN was introduced by [97]. It gives a list of potential outliers and their

    ranking. In this approach, the entire distance matrix needed to be calculated for all the

    points, but the authors introduced a partitioning technique to speed up the k-NN algorithm.

    Other outlier detection approaches are based on the Neural Networks. Neural networks

    are non-parametric and models that require training and testing to determine the threshold

    and be able to identify outliers. Most of them suffer when the data has high dimensionality.

    [10] and [22] define novelties in time-series data for fault diagnosis in vibration signatures

    of aircraft engines and Bishop monitor processes such as oil pipeline flows. They both

    use a supervised neural network (multilayer perceptron), which is a feed-forward network

    with a single hidden layer, where hidden layers are used to add on neurons to neural net-

    works architecture to build up the ability to solve highly complex nonlinear functions. The

    drawback to this is that the increased number of neurons also increases the necessary time

    needed by the neural network to converge during learning. [83] uses an auto-associative

    neural network which is also a feedforward perceptron-based network which uses super-

    vised learning. [106] introduced a detection technique for time series monitoring based on

    the Adaptive Resonance Theory (ART) [51] incremental unsupervised neural network.

    An approach that works well with high dimensional data is using decision trees as in

    [53] and [36] where they use a C4.5 decision tree to detect outliers in categorical data

    1This section has been published in International Conference on Machine Learning and Data MiningMLDM, 2009. [26]

    7

  • to identify unexpected entries in databases. They pre-select cases using the taxonomy

    from a case-based retrieval algorithm to prune outliers and then use these cases to train the

    decision tree. [103], [104] introduced an approach that uses similarity-based matching for

    monitoring activities.

    Local Outlier Factor (LOF) [24] algorithm detects outliers by measuring the local devi-

    ation of a given data point to its neighbors. LOF was designed for static data, but if repeat-

    edly applied, either periodically or every time a new data point comes in, this algorithm can

    be adopted by data streams. Pokrajac [43] proposed an incremental LOF algorithm where

    the reachability distance, local reachability density (LRD) and LOF values for each new

    data point are computed, and those values for existing points were updated. Hence outliers

    can be instantly detected.

    2.2. Data Streams

    The data mining community has provided many innovative technologies that address

    different issues. One of these is data streams, a new data mining area that involves data

    that is continuous and perhaps infinite. This type of data can be characterized as a high

    volume that arrives at a high velocity. Storing such data may be impractical, and even if

    such data volume was to be stored, processing any particular record more than once may

    be infeasible. See [29] for a detailed discussion on different streaming applications. Ad-

    ditionally, characteristics of data in streaming may change over time (e.g., concept drift).

    Since data streams can be viewed as time series, time series models were also considered.

    Traditional linear time series models consist of three statistical based models: autoregres-

    sive (AR), integrated (I), and the moving average (MA). The ARIMA model [23] integrates

    all of these models. Data Stream Mining has a single pass restriction that makes traditional

    time series models impractical. Thus, rather than using time series forecasting models to

    detect outliers, conventional multidimensional models that account for the temporal drift

    8

  • and deviations are used.

    2.3. Outlier Detection in Data Streams

    As new data arrives, the data stream models need to update their structures to capture

    the normal trends in the data. Outliers are then detected when the data causes a dras-

    tic change in the original model. Yamanishi and Takeuchi [61, 63] presented an online

    sequential discounting algorithm that incrementally learns a probabilistic mixture model.

    The model accounts for drift by using a decay factor. Moreover, the model can detect

    outliers by computing an outlier score from a learned mixture model. Depending on the

    type of data (continuous or categorical) different models were proposed. For categorical

    data, Sequentially Discounting Laplace Estimation (SDLE) utilizes a Laplace smoothing

    function to compute a probability score based on the occurrence frequency of a particular

    symbol divided by all the data points. For every new data point, the model needs to update

    all its cells. Two models were proposed for continuous data: Gaussian Mixture and Time

    Series. Both models detect an anomaly if the model at the time (t − 1) has changed after

    adding a new data point at time t. Further research by Javitz [56] proposed to update the

    normal distribution of the data by giving more weight to recent data. An accepted solution

    for streaming data is to model or summarize related data points into clusters, which helps

    avoid the retention of the whole dataset. Clustering models, in general, can be used to

    detect outliers. This methodology fits new data points into existing clusters, and outliers

    are detected when either new data points do not fit into the clusters or the internal clus-

    ter structure changes. Several popular clustering algorithms which can be used for outlier

    detection are reviewed below. Aggarwal and Yu [30] proposed clustering as a method to

    detect outliers. The k-Means clustering method is often used because it allows reallocation

    of samples even after assignment and converges quickly. The problem with basic k-Means

    is that the random allocation of cluster centers reduces its accuracy. Also, the values of k

    9

  • (number of clusters) and t (number of iterations) are difficult to set in advance. To counter

    this limitation, dynamic clustering approach was proposed. In fact, the underlying structure

    in data stream clustering continues to evolve as time passes. Detecting outliers, whether

    spatially or temporally, is particularly challenging. For instance, data points analyzed in

    an early stage can be incorrectly viewed as outliers; however, as time elapses a new trend

    may start to occur. Moreover, data points that are time delayed may also appear falsely as

    outliers. Thus, techniques such as dynamic time warping may help to discover the truth.

    Accordingly, the dynamic nature of data has motivated data mining researchers to develop

    innovative technologies to manage such requirements.

    For example2, E-Stream [64] handles the evolving data stream by providing cluster

    operations like add, delete, split, and merge. The algorithm starts empty, and for every time

    step based on a radius threshold either a new data point is mapped into one of the existing

    clusters, or a new cluster is created around the incoming data point. Any cluster that does

    not meet a defined density level is considered inactive and remains isolated until achieving

    the desired weight. Cluster weights are decreased over time to reduce the influence of old

    data points. This technique is well known as a fading function, where the cluster, which

    is inactive for a certain time period has a risk of being deleted. Also, for each step, a pair

    of clusters may be merged because either the overlap between two clusters is sufficiently

    large or the maximum cluster limit has been reached. The split of one cluster into two

    sub-clusters occurs if internal data is different. The split process creates one histogram for

    each active cluster, where the data dimension is summarized into an α-bin histogram, and

    then the split is performed if a deep valley between two significant peaks is found.

    CluStream [4] divides the clustering process into online and offline components. The

    online micro-clustering component periodically stores detailed summary statistics in a

    2Some of these clustering algorithms are defined in this section in more details compared to the originalpublished paper [27].

    10

  • high-speed data stream, and the offline macro-clustering component uses the summary

    statistics in association with user input to provide the user with a quick understanding of

    the clusters whenever required. This two-phased approach also provides the user with the

    flexibility to explore the nature of the evolution of the clusters over different time periods.

    DenStream [25] discovers clusters of arbitrary shapes in an evolving data stream by

    maintaining two lists, one with potential micro-clusters and the other with outlier micro-

    clusters. Each time a new data point arrives, an attempt is made to merge the point into

    one of the existing nearest potential micro-clusters. Based on the resulting micro-cluster,

    if its radius is larger than a specified radius then the merge is omitted, and then another

    attempt is made to merge the point with the nearest outlier micro-cluster. Once again, if

    the resulting radius is larger than a specified radius the merge is omitted, and a new outlier

    micro-cluster centered at that point is created and added to the outliers list. If any of the

    outlier micro-clusters exceed a specified weight, then it is moved into the potential micro-

    clusters list. Periodically an attempt is made to prune points from outlier micro-clusters list

    into potential micro-clusters.

    OpticsStream [42] is an online visualization algorithm that produces a map representing

    the clustering structure. It adds the ordering technique from OPTICS [81], which is not

    suitable for the data stream, on top of any density based algorithm such as DenStream to

    better manage the cluster dynamics.

    HPStream [5] is an online clustering algorithm that discovers distinct clusters based on

    a different subset of streaming data point dimensions. This is achieved by maintaining for

    each cluster a d-dimensional vector that indicates, which of the dimensions are included

    in the continuous assignment of incoming streaming data points to an appropriate cluster.

    The algorithm begins by assigning the received streaming data point to each of the existing

    clusters, and then it computes the radii and selects the dimensions with the smallest radii

    followed by creating a d-dimensional vector for each cluster. Next, the Manhattan distance

    11

  • is computed from the incoming data point to the centroid of each existing cluster (where its

    d-dimensional vector limits the centroid for each cluster). From these distances, the winner

    is found by returning the largest average distance along with the included dimensions. Then

    the radius is computed for the winning cluster and compared to the winning distance based

    on this comparison, and either a new fading cluster is created centered at the incoming data

    point, or the incoming data point is added to the winning cluster. Also, clusters are removed

    if they contain zero dimensions or if the number of clusters has exceeded the user defined

    threshold.

    WSTREAM [41] is a density-based algorithm that discovers cluster structure by main-

    taining a list of rectangular windows that are incrementally adjusted over time. Each win-

    dow will move based on the centroid of the cluster, and the centroid will be incrementally

    recomputed whenever new streaming data points are inserted into the window. The win-

    dows can also incrementally contract and expand based on the window approximated kernel

    density and the user-defined bandwidth matrix that is controlled by specified rules. In the

    case of windows overlap, the proportion of the number of streaming data points in the in-

    tersection of the pair of windows to the remaining points in each window is computed, and

    then it is compared to the user defined thresholds, which is then used to either remove or

    merge the windows. This algorithm also periodically monitors the windows weights from

    the stored windows. If the weights are less than the defined minimum threshold (which is

    considered to be an outlier) or are very old compared to the defined time, the windows are

    removed.

    D-Stream [31] is a density based clustering algorithm used for data streams. This al-

    gorithm works on the same basis as the time step model. It starts by initializing an empty

    hash table grid list. It contains both an online and offline component. The online compo-

    nent reads the incoming raw data record, and this record is either mapped to the existing

    grid list or inserted into the grid list if it does not exist. After the insertion of the record

    12

  • into the grid, the characteristic vector of the grid is updated. This characteristic vector con-

    tains all the information about the grid. Thus, the online component partitions the data into

    many corresponding density grids forming grid clusters. The offline component takes the

    role of dynamically adjusting the clusters. If the grid receives no new value for an extended

    period, then it is removed from the grid list. Such grids are known as sporadic grids that

    may contain outliers.

    MR-Stream [68] extends D-Stream by finding clusters at versatile granularities. It re-

    cursively partitions the data space into well-defined cells by using a tree data structure

    quadtree. MR-Stream facilitates both online and offline components.

    2.4. Spatiotemporal Outlier Detection in Data Stream

    Spatiotemporal data mining refers to a process that extracts hidden knowledge from

    both the spatial and temporal data space. Spatiotemporal is an emerging research area for

    data stream applications. Traditionally, data mining techniques considered spatiality and

    temporality as two separate research areas. However, today both combined have become

    a central requirement to process data events. According to the survey article on outlier

    detection by Manish [74], spatiotemporal outliers can be defined as spatiotemporal objects

    whose behavioral/thematic (non-spatial and non-temporal) attributes are significantly dif-

    ferent from those of the other objects in its spatial and temporal neighborhoods. Figure 2.2

    shows a workflow from a traditional spatiotemporal outlier detection framework. This

    framework models the outlier detection into three main components. The first component

    is responsible for finding objects from the input data stream that have interesting semantics.

    The next component analyzes these objects to identify if they are spatial outliers. Finally,

    the spatial outliers are examined across time to check if they are temporal outliers. Objects

    are classified as spatiotemporal outliers if found to be both spatial and temporal outliers.

    13

  • Spatio-Temporal Data

    Data

    DataData

    Verify Temporal Outliers

    Find Spatial

    Outliers

    Find Spatial Objects

    Spatio-Temporal Outliers

    Figure 2.2: A workflow from a traditional spatiotemporal outlier detection framework.

    14

  • Birant [40], Cheng [101, 32], and Adam [3] proposed anomaly detection algorithms

    that utilize a multi-step approach in Figure 2.2. They first try to detect spatial outliers and

    then verify their temporal neighborhood to determine the spatiotemporal outliers. The tech-

    niques use a modified version of DBSCAN [79] for both the spatial and temporal neighbors.

    These are given a radius followed by assigning a density factor to clusters that are intended

    to detect potential outliers. The two evaluations are performed to identify the spatiotem-

    poral outliers. Another technique proposed by Cheng uses spatial scaling with a four-step

    approach to address the semantic and dynamic properties of geographic phenomena for ST-

    Outlier detection. First, the algorithm finds semantic objects (i.e., spatiotemporal objects),

    which uses prior knowledge to form some regions that have significant semantic mean-

    ings. Next, aggregation, which focuses on detecting spatiotemporal outliers, is utilized to

    remove noise. Additionally, a comparison between the found outlier in the clustering phase

    is compared to the points that were filtered. A final step is to verify the temporal outliers

    based on the previous steps. Adam [3] uses a distance-based outlier detection technique,

    which establishes a spatial Voronoi grid to obtain macro-clusters. The algorithm uses Jac-

    card distance and the silhouette coefficient to determine the quality of the micro-clusters.

    Any points that substantially deviate from the neighborhood are flagged as spatiotempo-

    ral outliers. Other techniques such as outlier solids, Kulldorff scan statistic, and trajectory

    outliers can be considered for more detailed discussion [73].

    The Extensible Markov Model (EMM) [76] is a spatiotemporal algorithm that is based

    on first order Markov Chains (MC) described in [17]. EMM consist of two parts: a

    distance-based data stream clustering algorithm for spatial data that obtains representative

    granules in the continuous data space, and an MC to model temporal behavior. EMM ap-

    plies to data stream processing with the number of states unknown in advance and provides

    a heuristic modeling method where the approximation of the Markov property is appro-

    priate. EMM operations are entirely online and thus suitable for data streams. Figure 2.3

    15

  • shows the EMM’s framework for detecting spatial-temporal outliers. The rest of this thesis

    will investigate solutions and improvements for different aspects of this framework.

    EMM Outliers

    Clustering MarkovModel

    Data(t+n)

    Datat

    Data(t+1)

    Spatio-TemporalOutliers

    SpatiotemporalData

    Figure 2.3: The workflow from EMM outlier detection framework.

    16

  • Chapter 3

    RISK LEVELING OF NETWORK TRAFFIC ANOMALIES

    The goal of intrusion detection is to identify attempted or ongoing attacks on a computer

    system or network. Many attacks aim to compromise computer networks in an online

    manner. Traffic anomalies have been an important indication of such attacks. Challenges

    in the detections lie in modeling of the large continuous streams of data and performing

    anomaly detection in an online manner.

    In this chapter1, we will present a data mining technique to assess the risks of local

    anomalies based on synopsis obtained from a global spatiotemporal modeling approach.

    The proposed model is proactive in the detection of various types of traffic related attacks

    such as distributed denial of service (DDoS). It is incremental, scalable and thus suitable

    for online processing. Algorithm analysis shows the time efficiency of the proposed tech-

    nique. The experiments conducted with a DARPA dataset demonstrate that compared with

    a frequency based anomaly detection model, the false alarm rate caused by the proposed

    model is significantly mitigated without losing a high detection rate.

    3.1. Introduction

    Data mining is used to detect anomalies [120] [8] [78] [52] The goal of anomaly detec-

    tion is to ”find data objects that are different from most other objects” [86]. An anomaly

    can be used as an indication of a possible dangerous situation in computer networks and

    other systems. When an anomaly is detected by an anomaly detection model, an alarm is

    1This work has been published in International Journal of Computer Science and Network Security (IJC-SNS), 2006 [28] and presents joint work with Yu Meng and Professor Margaret H. Dunham.

    17

  • set and human intervention is invoked to examine whether the alarm represents an event

    of interest such as a dangerous situation or a malicious activity. Traffic anomaly is a type

    of anomaly. It refers to traffic characteristics that deviate from that which occurs at the

    majority of the time. These behaviors may have significant impact on the system. Traffic

    anomaly has received attention as a major indicator of risk exposure in computer networks.

    For example, Juniper Networks has proposed a combination of traffic anomaly detection,

    protocol anomaly detection and stateful signatures to identify a variety of types of attacks

    in computer networks [91]. Cisco has delivered the Cisco Traffic Anomaly Detector XT

    5600 for detection of distributed denial of services (DDoS), worms, and other attacks [33].

    Applications of traffic anomaly mining can be intuitively extended to highway traffic op-

    eration, and electric power demand management. However, an anomaly is not necessarily

    a risk. Generally as a higher detection rate is pursued with an anomaly detection model, a

    higher false alarm rate is caused as well. Needed human intervention caused by false alarms

    is very costly and there is a demand to reduce unnecessary human intervention. Automatic

    techniques are desired to evaluate the chance that an anomaly is of interest so as to take

    out some anomalies that are probably not a risk. Existing anomaly detection work uses

    either frequency based or data deviation based [120] [78] [46] [96] [9] [85] approaches.

    We have noticed that these may suffer from a high false alarm rate. In this chapter we

    propose a risk leveling model, a two phase data mining technique with rules using both oc-

    currence frequency and data deviations. The proposed model detects the anomalies based

    on frequency and then measures the deviation of the anomaly away from the data space.

    The level of risks with which the anomaly is associated is evaluated by the deviation, as

    we envision that anomaly data space when risks occur is located away from the normal

    data space. A common characteristic of a data stream is its high volume of data; moreover

    the data continuously arrives at a rapid rate. It is not feasible to store all data from the

    streams and use random accesses to the data as we do in traditional database. This implies

    18

  • a single pass restriction for all data in the streams [55]. Therefore, the data stream must be

    modeled in order to obtain a synopsis of the global profile of the dataset. Data mining is a

    key technique in modeling stream data. Our proposed risk leveling model is built based on

    the Extensible Markov model (EMM), a spatiotemporal modeling technique [76]. The risk

    leveling model uses the synopsis obtained from the EMM modeling process. Performance

    comparisons with a frequency based anomaly detection model [120] are expected to show

    the low false alarm rate without losing a high detection rate of the proposed risk leveling

    model. Also the proposed model inherits the incrementality and scalability of the EMM.

    3.2. Related Work

    Our proposed technique assesses the chance of alarms raised by a frequency based

    anomaly detection model to actually be events of interest, i.e. risks. Before we present

    risk-leveling model, we first introduce related work followed by frequency based anomaly

    detection technique. Among prominent properties of anomaly are its rarity and possible sig-

    nificance. These properties distinguish anomaly detection techniques from modeling tech-

    niques in other subjects in feature selection/construction and evaluation metrics. Lazare-

    vic [8] indicates that unsupervised techniques and supervised techniques are the two major

    categories of techniques in anomaly detection. The unsupervised technique is capable of

    mining unlabeled data. That means no priori knowledge is required for ”normal” profiles.

    An anomaly is detected by selecting an event that deviates from the majority. Although a

    variety of algorithms can be applied, some common steps are seen as follows:

    • Construct features. The features may be constructed in a weighted numeric vector.

    • Determine a distance measure from the data point, which represents an event under

    investigation, to a cluster. The kth nearest neighbor distance [75], similarity (Jaccard,

    Cosine, Overlap, Dice [75]), Euclidean distance, Manhattan distance [75], skewed

    19

  • distance (Mahalanobis distance [8]), and density distance (LOF) [78] are of common

    distance measures.

    • Apply an anomaly detection algorithm to the data, based on a set of rules defining

    anomalies. The following categories of anomaly detection algorithms are seen in the

    literature:

    – Distance based algorithms [46] [96],

    – Statistics based algorithms [84] including finite mixture model [62], and infor-

    mation theory [113].

    – Model based algorithms such as neural networks [92] and SVM [9].

    Distance-based algorithms are based on clustering and form a major category of tech-

    niques in anomaly detection. These techniques neither assume independence among dif-

    ferent dimensions of data as statistical based algorithms do, nor are these as sensitive to

    the initial selection of the model as model based algorithms are. The meaning of the tech-

    niques is easy to interpret and is suitable for spatial data mining. Temporality can be

    another expected characteristic of anomalies in additional to spatiality, particularly in traf-

    fic anomaly detection. Markov chains and suffix trees have been used [44] [85] to store

    temporal profiles. The benefit of a Markov chain is its concise presentation in mathemat-

    ics. Variations of the Markov chain with dynamic structures have been proposed to model

    dynamically changing data [76] [35]. The Suffix tree stores all suffixes of a sequence and is

    linearly efficient in string matching with the suffixes. The EMM [76] takes the advantage of

    distance-based clustering for spatial data as well as that of the Markov chain for temporal-

    ity. EMM achieves an efficient modeling by mapping groups of closely located real world

    events to states of a Markov chain. EMM is an extension to the Markov chain. EMM uses

    clustering to obtain representative granules in the continuous data space. Also by provid-

    ing a dynamically adjustable structure, EMM is applicable to data stream processing when

    20

  • the number of states is unknown in advance and provides a heuristic modeling method for

    data that hold approximation of the Markov property. EMM formalizes a framework for

    spatiotemporal data mining by introducing phases including clustering and Markov chain

    construction which model the data stream so as to get the synopsis of the data profile, and

    applications which are built on the synopsis. Subsequently, we will give a concise descrip-

    tion of EMM and this description, which should be sufficient to grasp the scope of our

    work. Further information concerning EMM can be found in [76]. A multidimensional

    data point in EMM represents a real world event. The data point can be represented in a

    hyperspace as a vector. EMM defines a set of formalized procedures such that at any time

    t, EMM consists of a Markov Chain (MC) and algorithms to modify it, where algorithms

    include:

    1. EMMCluster: defines a technique for matching between input data at time t+ 1 and

    existing states in the MC at time t. This is a clustering algorithm which determines

    if the new data point or event should be added to an existing cluster (MC state) or

    whether a new cluster (MC state) should be created. A distance threshold th is used

    in clustering.

    2. EMMBuild: is an algorithm that updates (as well as adds, deletes, and merges) MC

    at time t+ 1 given the MC at time t and output of EMMCluster at time t+ 1.

    3. EMMapplications: are algorithms that use the EMM to solve various problems. To

    date, we have examined EMM for prediction (EMMPredict) [76] and anomaly (rare

    event) detection(EMMRare) [120].

    Throughout this chapter, we use a view of EMM as depicted by a directed graph with

    nodes and links. We use link and transition interchangeably to refer to a directed arc;

    and use node, state, and cluster interchangeably to specifically refer to a vertex in the

    EMM. These algorithms are executed in an interleaved manner. The first two phases are

    21

  • used to model the data. The third phase is used to perform applications based on the

    synopsis created in the modeling process. The synopsis includes information of cluster

    features [107] and transitions between states. The cluster feature defined in [107] includes

    at least a count of occurrence, CNi (count on the node) and either a medoid or centroid

    for that cluster, LSi. To summarize, elements of the synopsis of an EMM are listed in

    Table 3.1.

    Table 3.1: Notations of EMM Elements

    Legend of Notations

    Notation Description

    Ni The ith EMM node, labeled by CNi and LSiCNi Count of occurrences of data points found in the cluster (EMM node or EMM state) NiLSi A vector representing the representative data point of the cluster, usually being centroid or medoid of the cluster

    Lij The directed link from Ni to Nj , labeled by CLijCLij Count of occurrences of the directed link from Ni to Njm Number of EMM states

    n Number of attributes in the vector representing a data points, or dimensions of the data space

    In this chapter, the frequency based anomaly detection algorithm [120] is used to com-

    pare with the proposed model, which is one of the several applications of EMM. We give

    a brief review of the approach used. The idea for anomaly detection comes from the fact

    that the learning aspect of EMM dynamically creates a Markov chain that captures past

    behavior stored in the synopsis. No input into the model identifies normal or abnormal be-

    havior instead this is learned based on the statistics of occurrence of transitions and states

    within the generated Markov chain. By learning what is normal, the model can predict

    what is not. The basic idea is to define a set of rules related to cardinalities of clusters and

    transitions to judge anomalies. An anomaly is detected if an input event (or an data point),

    Et, is determined not to belong to any existing cluster (state in EMM), if the cardinality of

    the associated cluster (CNn) is small, or if the transition (CLij) from the current state, i, to

    the new state, j, is small. When any of the predefined rules are met, a Boolean alarm, At,

    22

  • is set to indicate capture of anomalies.

    3.3. Methodology

    In this section we present the steps to build a risk leveling model, based on EMM

    modeling [76] and the frequency based anomaly detection model [120], as well as the

    evaluation metrics. KDD defines preprocessing procedures of data [108] to convert the

    format of raw data to the format that is appropriate for data mining. Our preprocessed data

    will use a structured format which combines the time stamp and spatial traffic statistics in

    one vector:

    Vt =< Dt, Tt, S1t, S2t, ..., Sit, ... >,

    where Dt denotes type of day, Tt time of the day, and Sit the value of statistic found at

    a spatial location i, at time t. This spatiotemporal format defines an input real world event

    (input data point) in the multidimensional data space. Assume there are n elements in the

    vector. Therefore each data point can be represented as a vector in n-dimensional space.

    A trait of EMM is that it learns while performing a task so as to dynamically adapt the

    time variant dataset. To perform mining of risk levels, the following are applied.

    1. EMMCluster: Nearest neighbor clustering,

    2. EMMBuild,

    3. EMMAnomaly,

    4. EMMRiskLeveling.

    Algorithms EMMCluster and EMMBuild define the modeling process of EMM [76].

    Combined with algorithm EMMAnomaly [76], an EMM anomaly detection model based

    on occurrence frequency is defined and has been introduced in preceding section. The

    23

  • anomaly detection model sets alarms, At, based on a set of predefined, frequency based

    rules. To build a risk leveling model, a new algorithm, EMMRiskLeveling, is added. The

    risk leveling model outputs a risk leveling index by combining the frequency based anomaly

    alarm and evaluation of deviation of the local pattern in the normal data space. We will see

    that the deviation evaluation can be calculated incrementally.

    To evaluate the deviation, we use two parameters, centroid −→c (t) and diameter D(t), to

    characterize the data space Ω of the model. Here the data space Ω refers to the region that

    the data points occupy in the n-dimensional hyperspace. It is equivalent to the region that

    the EMM nodes are distributed. The centroid of Ω is given in Definition 3.1. Moreover the

    centroid can be computed incrementally. Using the incrementality, the time complexity is

    reduced from O(m) to O(1), as given in Lemma 3.1.

    Definition 3.1 (Centroid of data space:) Denote an EMM node to be Ni, the number of

    data points included in the node to be−−→CN i, and first moment or the representative location

    of Ni is−→LSi. The centroid of the data space −→c (t) is defined as:

    −→c (t) =m∑i=1

    −→LSi ∗ CNi

    t(3.1)

    Lemma 3.1 (Incrementality of centroid of data space.) Given−→c (t−1) and the first mo-

    ment of current EMM state−−→LSc, then −→c (t) can be expressed in incremental manner.

    −→c (t) =−→c (t− 1) ∗ (t− 1) +

    −−→LSc

    t= (3.2)

    −→c (t− 1) ∗ (1− 1t) +

    −−→LSct

    24

  • proof 3.1 First we should note that−−→LSc is the same as

    −−→LSt. We consider two cases:

    1. Nc is a new EMM node:

    −→c (t) =m(t)∑i=1

    −→LSi ∗ CNi

    t

    =m(t−1)∑i=1,i 6=c

    −→LSi ∗ CNi

    t+

    −−→LSct

    =m(t−1)∑i=1,i 6=c

    −→LSi ∗ CNit− 1

    ∗ t− 1t

    +

    −−→LSct

    =−→c (t− 1) ∗ (t− 1) +

    −−→LSc

    t

    = −→c (t− 1) ∗ (1− 1t) +

    −−→LSct

    25

  • 2. Nc is an existing EMM node:

    −→c (t) =m(t)∑i=1

    −→LSi ∗ CNi

    t

    =m(t)−1∑i=1,i 6=c

    −→LSi ∗ CNi

    t+

    −−→LSc ∗ CNc

    t

    =m(t)−1∑i=1,i 6=c

    −→LSi ∗ CNi

    t+

    −−→LSc ∗ (CNc − 1)

    t+

    −−→LSct

    =m(t−1)∑i=1

    −→LSi ∗ CNit− 1

    ∗ t− 1t

    +

    −−→LSct

    =−→c (t− 1) ∗ (t− 1) +

    −−→LSc

    t

    = −→c (t− 1) ∗ (1− 1t) +

    −−→LSct

    Now we define the diameter of the Ω in Definition 3.2.

    Definition 3.2 Denote an EMM node to be Ni, the number of data points included in the

    node to be CNi and the distance between any two EMM nodes, Ni, Nj to be dij . The

    diameter of the data space at time t, D(t), is defined by:

    D(t) =(∑m

    i=1

    ∑mj=1

    ((dij)2∗CNi∗CNj)

    2t(t−1)

    )1/2(3.3)

    where t is the time instance and m is the number of EMM nodes. For simplicity in

    computations, we define that:

    d(t) =(∑m

    i=1

    ∑mj=1

    ((dij)2∗CNi∗CNj)

    2

    )1/2(3.4)

    26

  • Therefore, we have,

    D(t) =(

    d2(t)t(t−1)

    )1/2(3.5)

    or,

    D2(t) =d2(t)

    t(t− 1)(3.6)

    At each time instance, theD(t) gives a weighted inter-cluster distance of the data points

    received so far, and can be used to measure the size of the data space. Actually it can be

    seen as an approximation of the inter-data point distance in the data space by ignoring the

    inter-point distance among the data points within the same clusters. As we can see, the

    computation complexity of this O(m2). However given the incrementality, the computa-

    tion complexity can be reduced to O(m).

    Lemma 3.2 (Incrementality of diameter of data space.) Given diameter of data space

    at time instance t− 1, then the diameter of data space at time instance t can be expressed

    in incremental manner:

    d2(t) = d2(t− 1) +m(t−1)∑i=1

    (d2i (t) ∗ CNi) (3.7)

    Since the proof is very similar to that of incrementality of the centroid, we skip the

    27

  • proof here. Now, denote that at time instance t, the distance between the current Node Nc

    and −→c (t) is dcc(t). We define a risk leveling index as in Definition 3.3.

    Definition 3.3 (Risk Leveling Index:) Given an alert raised by the frequency based anomaly

    detection model when a data point−→Et is input, the risk leveling index caused by data devi-

    ation is given by a hyperbolic tangent sigmoid function, and is defined as:

    a(t) =er(t) − e−r(t)

    er(t) + e−r(t)(3.8)

    where,

    r(t) =(

    d2ccD2(t)

    )1/4(3.9)

    or,

    r(t) =(t(t−1)∗d2ccd2(t)

    )1/4(3.10)

    for simplicity of computations.

    The a(t) yields a output range [0, 1) because the ratio r(t) is never negative. The further

    the current data point is located outside the border of the data space, the more likely the

    data point is associated with a risk. This is an induction of our assumption. The procedures

    to compute the risk leveling index is illustrated in Algorithm 1.

    28

  • input : At : Boolean output of the frequency based anomaly detection model attime t.Gt : EMM at time t

    output: a(t) : Risk leveling index at time t

    foreach time instance t do1if At == true then2

    Update −→c (t) using (3.3);3Update D(t) using (3.7) and (3.5) 0r (3.6);4Compute a(t) using (3.10);5

    Algorithm 1: EMMRiskLevel

    Example 3.1 (Given an EMM at time 5, specified as:)

    N1 = { 2, < 1, 4 > }, N2 = { 3, < 2, 3 > },

    L11 = { 1 };L12 = { 1 };L21 = { 1 };L22 = { 2 };

    −→c (5) =< 8/5, 17.5 >

    d2(5) = 12

    Our proposed approach to determining risk leveling index based on synopsis has the

    following benefits:

    • Computations takes O(1) time for −→c (t) and O(m) time for D(t). Recall that the

    EMM takes O(m) time for clustering and O(1) time for Markov chain updates. The

    proposed approach inherits the time efficiency that EMM possesses.

    • The proposed approach is solely based on synopsis of EMM obtained at current time.

    Thus the proposed method is as incremental and scalable as EMM does.

    • The proposed approach learns in an unsupervised manner while performs applica-

    tions. It is not heavily dependent on a training process and thus is suitable for stream

    data processing.

    29

  • If a data point−→d6 = < 1, 3 > is input at time t = 6, is clustered into a new EMM node

    N3, and the frequency based anomaly model set an alarm At = true due to its rules, then

    using Algorithm 1, we have:

    −→c (6) =<85∗ 5 + 1

    6,175∗ 5 + 3

    6>=< 3/2, 10/3 >

    d2(6) = d2(5) + (1 + 1) = 12 + 2 + 14,

    D2(6) =d2(6)

    6 ∗ (6− 1)= 7/15,

    d2cc = | < 1− 8/5, 3− 17/5 > |2 = 2/5,

    r(6) =25715

    = 6/7,

    r(6) = 0.69.

    We consider several evaluation metrics to compare the performance of our proposed

    model to the frequency based anomaly detection model: Detection (also true positive or

    recall or hit rate in the literature) Rate, False Alarm (or false positive) Rate [120]. Pre-

    cision (or positive predictive value) and F1 (also F-score or F-measure) score. Detection

    Rate refers to ratio between the numbers of correctly alarmed risks to the total number

    of risks that were incorrectly labeled as normal data points. False Alarm Rate refers to

    the expectancy of the false positive ratio. F1 score is a measure of a test’s accuracy (See

    definition in 3.11, 3.12, 3.13 and 3.14).

    30

  • Precision =TP

    TP + FP(3.11)

    True Positive Rate =TP

    TP + FN(3.12)

    False Alarm Rate =FP

    FP + TN(3.13)

    F1 =2TP

    2TP + FP + FN(3.14)

    3.4. Experiments and Analysis

    This section briefly reports the results of experiments conducted comparing the pro-

    posed model to a frequency based model. We demonstrate the learning capacity, impact of

    parameters in time and memory utilization. The frequency based anomaly detection model

    is introduced in the earlier Section.

    3.4.1. Dataset

    In 1998, 1999 and 2000, the MIT Lincoln Laboratory [57] conducted a comparative

    evaluation of intrusion detection system (IDSs) developed under DARPA funding. This

    effort was to examine Internet traffic in the air force bases. The traffic was performed in

    a simulation network. The idea was to generate a set of realistic attacks, embed them in

    normal data, and evaluate the false alarm and detection rates of systems with these data,

    in order to enrich performance’s improvement of existing IDS [57]. We use the DARPA

    dataset as a testcase of our proposed model.

    In order to extract information from the tcpdump datasets of DAPRA, TcpTrace utility

    software [102] was used. This preprocessing procedure was applied to TCP connection

    records, but ignores ICMP and UDP packets. The new feature-list attained from ”raw

    31

  • tcpdump data” using the TcpTrace software is presented in Table 3.2. The preprocessed

    dataset is structured in nine different features, where each feature denotes the statistical

    count of network traffic within a fixed time interval.

    Table 3.2: The extracted features from raw tcpdump data using tcptrace software

    Extracted Relevant Features

    Name Description

    IIN The number of packets flowing from inside to inside network

    ION The number of packets flowing from inside to outside network

    IDN The number of packets flowing from inside to DMZ network

    OON The number of packets flowing from outside to outside network

    OIN The number of packets flowing from outside to inside network

    ODN The number of packets flowing from outside to DMZ network

    DDN The number of packets flowing from DMZ to DMZ network

    DIN The number of packets flowing from DMZ to inside network

    DON The number of packets flowing from DMZ to outside network

    Preprocessed network traffic statistics is gathered at every 10 second for investigation.

    The DARPA 1999 dataset which is free of attacks for two weeks (1st week and 3rd week) is

    used as training data and DARPA 2000 dataset which contains DDoS attacks is used a test

    data. We obtained 20270 rows from the first week and 21174 rows from the third week to

    create the normal dataset and the dataset is used for modeling. The DARPA 2000 dataset

    which contains attacks has 1048 rows. Figure 3.1 shows DARPA 2000 data profile with

    attacks.

    3.4.2. Experiments

    Now we present the performance experiments that compare two models with deriva-

    tions from the confusion matrix. Table 3.3 gives the legends used in this section for quick

    reference. The experiment result shows that using the frequency based anomaly detection

    32

  • model, it detects the attack after running first week of training data. However the side-effect

    is the high false alarm rate and low detection rate. By training with third week it drops the

    false alarm with 5% and also the detection rate increases with 5%. Tables 3.5 and 3.6 pro-

    vides detection and false positive rates from the first week and continuing training with the

    third week. The threshold used is 0.8 with Jaccard clustering.

    Figure 3.1: Logarithm of traffic volume shows the DDoS attacks

    Table 3.4 shows the number of states created in EMM using the first and third weeks

    of DARPA 1999 normal dataset. As we can see the number of EMM nodes or states is

    slightly different. This demonstrates the learning capability of the EMM although exhaus-

    tive learning is not possible. This observation is consistent with [76] which has reported

    a sublinear growth rate of number of EMM states. Also compared with the size of the

    dataset, the number of EMM states is really low in all cases in the table, which implies the

    efficiency of the model. We can also see that different similarity measures with different

    threshold values yield different number of EMM nodes or states in the modeling processes.

    Thus selection of threshold values impacts the memory usage and time utilization.

    To conclude, the proposed risk leveling model lowers the false alarm rate compared

    33

  • Table 3.3: Legend used in the performance evaluation with derivations from the confusionmatrix.

    Legend of performance experiments

    Name Description

    NOA Number of observable attacks

    NA Number of alerts

    NTAD The number of packets flowing from inside to DMZ network

    P Precision

    TPR True positive rate

    FAR False alarm rate

    F1 F-measure

    with the frequency based anomaly detection model and keeps a high detection rate in the

    test cases. The approach is efficient, incremental and scalable.

    3.5. Chapter Summary

    This chapter presents a novel data mining technique to detect traffic based network in-

    trusions. Our proposed technique takes both frequency and data deviation into account in

    an efficient, incremental, and scalable anomaly detection model. The performance experi-

    ments support our assumption that the traffic related network intrusion is companied with

    data deviation. The technique is suitable for online processing.

    There are several directions for future research. These directions include design of

    models incorporating signatures that were previously determined to be risks, investigation

    of correlations of the parameters, exploration of feasibility of the model for dynamic dataset

    in grid computing environments.

    34

  • Table 3.4: Impacts of clustering thresholds and selection of similarity measures

    Parameter AnalysisNormal Dataset

    DARPA 1999 SimThreshold

    0.7 0.80 0.90 0.99

    First week

    Jaccard 148 298 855 7794Die 72 120 372 5033

    Cosine 13 21 59 1298Overlap 6 10 11 38

    Difference

    Third week

    Jaccard 181 367 1124 11820Die 84 145 449 7222

    Cosine 13 22 63 1702Overlap 6 10 11 42

    Diff betweenfirst & third

    weeks

    Jaccard 33(18.23%) 69(18.8%) 269(23.93%) 4026(34.1%)Die 12(14.3%) 25(17.24%) 77(17.15%) 2189(30.74%)

    Cosine 0% 1(4.55%) 4(6.35%) 404(23.74%)Overlap 0% 0% 0% 4(9.52%)

    Table 3.5: Detection rate and false alarm rate using frequency based anomaly detectionmodel

    Performance of the frequency based anomaly detection modelSetting NOA NA NTAD P TPR FAR F1

    First Week Dataset 1 5 1 0.2 1 0.00382 0.3333333With Third Week Dataset 1 4 1 0.25 1 0.002865 0.4

    Table 3.6: Detection rate and false alarm rate using risk leveling anomaly detection model

    Performance of the risk leveling anomaly detection modelSetting NOA NA NTAD P TPR FAR F1

    First Week Dataset 1 1 1 1 1 0 1With Third Week Dataset 1 1 1 1 1 0 1

    35

  • Chapter 4

    A COMPARATIVE STUDY OF OUTLIER DETECTION ALGORITHMS

    In the previous chapter, we studied a new anomaly detection model that is based on

    Extensible Markov Model. A spatiotemporal model that can be used to detect outliers

    in data streams. In this chapter1, we will study EMM’s outlier detection performance on

    different real life datasets and test its performance against two spatial outlier detection

    models.

    4.1. Introduction

    Data Mining is the process of extracting interesting information from large sets of data.

    Outliers are defined as events that occur very infrequently. Detecting outliers before they

    escalate with potentially catastrophic consequence is very important for various real life

    applications such as: fraud detection, network robustness analysis, and intrusion detec-

    tion. This chapter presents a comprehensive analysis of three outlier detection methods i.e.

    Extensible Markov Model (EMM), Local Outlier Factor (LOF) and LCS-Mine. In Algo-

    rithm analysis section we demonstrate the time complexity analysis and outlier detection

    accuracy. The conducted experiments with Ozone level Detection, IR video trajectories,

    and 1999 and 2000 DARPA DDoS datasets indicate that EMM outperforms both LOF

    and LSC-Mine in both time and outlier detection accuracy. Recently, outlier detection has

    gained an enormous amount of attention and become one of the most important problems

    in many industrial and financial applications. Supervised and unsupervised learning tech-

    niques are the two fundamental approaches to the problem of outlier detection. Supervised1This work has been published in International Conference on Machine Learning and Data Mining

    MLDM, 2009. [26]

    36

  • learning approaches build models of normal data and detect deviations from the normal

    model in observed data. The advantage of these types of outlier detection algorithms is that

    they can detect new types of activity as deviations from normal usage. In contrast, unsu-

    pervised outlier detection techniques identify outliers without using any prior knowledge

    of the data. It is essential for outlier detection techniques to detect sudden or unexpected

    changes in existing behavior as soon as possible. Assume for example the following three

    scenarios:

    1. A network alarm is raised indicating a possible attack. The associated network traffic

    is abnormal from the normal Network traffic. The security analyst discovers that the

    enormous traffic is not produced from the Internet, but from its Local Area Network

    (LAN). This scenario is characterized as zombie effect in a Distributed Denial of Ser-

    vices (DDoS) attack [120], where the LAN is utilized in the DDoS attack to deny the

    services for a targeted Network. It also means that the LAN has been compromised

    long before the discovery of DDoS attack. Computer systems in a LAN provide ser-

    vices that correspond to certain types of behavior, if a new service is started without

    system administrator permission, then it is extremely important to set an alarm and

    discover suspicious activities as soon as possible in order to avoid disaster.

    2. Video surveillance [121] is frequently encountered in commercial, residential or mil-

    itary buildings. Finding outliers in the video data involves mining massive surveil-

    lance video databases automatically collected to retrieve the shots containing inde-

    pendently moving targets. The environment where it operates is often very noisy.

    3. Today it is not news that the ozone layer is getting thinner and thinner [70]. This

    is harmful to human health, and affects other important parts of our daily life, such

    as farming, tourism etc. Therefore an accurate ozone alert forecasting system would

    facilitate issuance of warnings to the public at an early stage before the ozone reaches

    a dangerous level.

    37

  • One recent approach to outlier detection, Local Outlier Factor (LOF) [78], is based

    on the density of data close to an object. This algorithm has proven to perform well,

    but suffers from some performance issues. In this chapter we compare the performance

    of LOF and one of its extensions, LSC-Mine [72], to the use of our previously proposed

    modeling tool Extensible Markov Model (EMM) [120]. This comparative study provides

    a study of these three outlier algorithms and denotes their time and detection performance.

    Extensible Markov Model (EMM) is a spatiotemporal modeling technique that interleaves

    a clustering algorithm with a first order Markov Chain (MC) [82], where at any point in

    time EMM can provide a high level summary of the data stream. Local Outlier Factor

    (LOF) [78] is an unsupervised density-based algorithm that assigns to each object a degree

    to be an outlier. It is local in that, the degree depends on how isolated the object is with

    respect to the surrounding neighborhood. LSC-Mine [72] was constructed to overcome the

    disadvantages of the LOF technique proposed earlier.

    4.1.1. Extensible Markov Model

    Extensible Markov Model (EMM) [76] has the advantage of using a distance-based

    clustering for spatial data as well as that of the Markov chain for temporality. And as proved

    in our previous work [28], EMM achieves an efficient modeling by mapping groups of

    closely located real world events to states of a Markov chain. EMM is thus an extension to

    the Markov chain. EMM uses clustering to obtain representative granules in the continuous

    data space. Also by providing a dynamically adjustable structure, EMM is applicable to

    data stream processing when the number of states is unknown in advance and provides a

    heuristic modeling method for data that hold approximation of the Markov property. The

    nodes in the graph are clusters of real world states, where each of them is a vector of sensor

    values, for example a flood level sensor in a river bend, that continuously feeds values

    creating a data stream. The EMM defines a set of formalized procedures such that at any

    38

  • time t, EMM consists of a Markov Chain (MC) [13] and algorithms to modify it, where

    algorithms include:

    1. EMMCluster defines a technique for matching between input data at time t + 1 and

    existing states in the MC at time t. This is a clustering algorithm which determines

    if the new data point or event should be added to an existing cluster (MC state) or

    whether a new cluster (MC state) should be created. A distance threshold th is used

    in clustering. For more details see Algorithm 2

    2. EMMIncrement algorithm updates (as well as adds, deletes, and merges) MC at time

    t + 1 given the MC at time t and output of EMMCluster at time t + 1. For more

    details see Algorithm 3

    3. EMMapplications are algorithms which use the EMM to solve various problems. To

    date we have examined EMM for prediction (EMMPredict) [76] and anomaly (rare,

    outlier event) detection (EMMRare) [120].

    Throughout this chapter, EMM is viewed as directed graph with nodes and links. Link

    and transition are used interchangeably to refer to a directed arc; node, state, and cluster

    are used interchangeably to specifically refer to a vertex in the EMM. EMMCluster and

    EMMIncrement are used to model the data. The EMMapplications is used to perform

    applications based on the synopsis created in the modeling process. The synopsis includes

    information of the cluster features [107] and transitions between states. The cluster feature

    defined in [107] includes at least a count of occurrence, CNi (count on the node) and

    either a medoid or centroid for that cluster, LSi. To summarize, elements of the synopsis

    of an EMM are listed in Table 1. The frequency based anomaly detection [76] is used to

    compare with LOF, and LSC-Mine algorithms, that is one of the several applications of

    EMM. The idea for outlier detection comes from the fact that the learning aspect of EMM

    dynamically creates a Markov chain and captures past behavior stored in the synopsis. No

    39

  • input into the model identifies normal or abnormal behavior, instead this is learned based

    on the statistics of occurrence of transitions and states within the generated Markov chain.

    By learning what is normal, the model can predict what is not. The basic idea is to define a

    set of rules related to cardinalities of clusters and transitions to judge outlier. An outlier is