3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie...

44
3/11/10, BYU 1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas, Texas 75275 lyle.smu.edu/~mhd [email protected] This material is based upon work supported by the National Science Foundation under Grant No IIS-0948893.

Transcript of 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie...

Page 1: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 1

The Magnificent EMM

Margaret H. Dunham

Michael Hahsler, Mallik Kotamarti, Charlie IsakssonCSE Department

Southern Methodist University

Dallas, Texas 75275

lyle.smu.edu/~mhd

[email protected]

This material is based upon work supported by the National Science Foundation under Grant No IIS-0948893.

Page 2: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Objectives/Outline

EMM Overview EMM + Stream Clustering EMM + Bioinformatics

3/11/10, BYU 2

Page 3: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Objectives/Outline

EMM Overview Why What How

EMM + Stream Clustering EMM + Bioinformatics

3/11/10, BYU 3

Page 4: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Lots of Questions

Why don’t data miners practice what they preach?

Why is training usually viewed as a one time thing?

Why do we usually ignore the temporal aspect of data streams?

3/11/10, BYU 4

Continuous Learning

Interleave learning & application

Add time to online clustering

Page 5: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

53/11/10, BYU

MM

A first order Markov Chain is a finite or countably infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state

A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that:

S ={N1,N2, …, Nm}, and A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc,

Lij = <Ni,Nj> is labeled with a transition probability

Pij = P(Nj | Ni).

Page 6: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

63/11/10, BYU

Problem with Markov Chains

The required structure of the MC may not be certain at the model construction time.

As the real world being modeled by the MC changes, so should the structure of the MC.

Not scalable – grows linearly as number of events. Our solution:

Extensible Markov Model (EMM) Cluster real world events Allow Markov chain to grow and shrink

dynamically

Page 7: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 7

EMM (Extensible Markov Model)

Time Varying Discrete First Order Markov Model

Continuously evolves Nodes are clusters of real world states. Learning continues during prediction phase. Learning:

Transition probabilities between nodes Node labels (centroid of cluster) Nodes are added and removed as data

arrives

Page 8: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 8

EMM Definition

Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include:

EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t.

EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1.

EMMDecrement algorithm, which removes nodes from the EMM when needed.

Page 9: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 9

EMM Cluster

Nearest Neighbor If none “close” create new node Labeling of cluster is centroid of

members in cluster O(n)

Here n is the number of states

Page 10: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 10

EMM Increment

<18,10,3,3,1,0,0>

<17,10,2,3,1,0,0>

<16,9,2,3,1,0,0>

<14,8,2,3,1,0,0>

<14,8,2,3,0,0,0>

<18,10,3,3,1,1,0.>

1/3

N1

N2

2/3

N3

1/11/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/31/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/21/1

N1

1

Page 11: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

113/11/10, BYU

EMMDecrement

N2

N1 N3

N5 N6

2/2

1/3

1/3

1/3

1/2

N1 N3

N5 N6

1/61/6

1/6

1/31/3

1/3Delete N2

Page 12: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

123/11/10, BYU

EMM Advantages

Dynamic Adaptable Use of clustering Learns rare event Scalable:

Growth of EMM is not linear on size of data.

Hierarchical feature of EMM Creation/evaluation quasi-real time Distributed / Hierarchical extensions

Page 13: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 13

EMM Sublinear Growth

Servent Data

Page 14: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

143/11/10, BYU

Growth Rate Automobile Traffic

Minnesota Traffic Data

Page 15: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

EMM River Prediction

3/11/10, BYU 15

0

1

2

3

4

5

6

7

8

1 48 95 142 189 236 283 330 377 424 471 518 565 612 659

Wat

er L

evel

(m

)

Input Time Series

RLF Prediction EMM Prediction Observed

Page 16: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 16

Determining Rare Event

Occurrence Frequency (OFi) of an EMM state Si is normalized count of state:

Normalized Transition Probability (NTPmn),

from one state, Sm, to another, Sn, is a

normalized transition Count:

i

iii nnOF /

i

inmnm nCNTP )/()( ,,

Page 17: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

EMM Rare Event Detection

3/11/10, BYU 17

Intrusion Data, Train DARPA 1999, Test DARPA 2000,

Ozone Data, UCI ML, Jaccard similarity,

2536 instances, 73 attributes, 73 ozone days

Page 18: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Objectives/Outline

EMM Overview

EMM + Stream Clustering Handle evolving clusters Incorporate time in clustering

EMM + Bioinformatics

3/11/10, BYU 18

Page 19: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 19

Stream Data

A growing number of applications generate streams of data. Computer network monitoring data Call detail records in telecommunications Highway transportation traffic data Online web purchase log records Sensor network data Stock exchange, transactions in retail chains, ATM

operations in banks, credit card transactions.Clustering techniques play a key role in

modeling and analyzing this data.

Page 20: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 20

Stream Data Format

Events arriving in a stream At any time, t, we can view the state

of the problem as represented by a vector of n numeric values:

Vt = <S1t, S2t, ..., Snt>

V1 V2 … VqS1 S11 S12 … S1q

S2 S21 S22 … S2q

… … … … …Sn Sn1 Sn2 … Snq

Time

Page 21: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Traditional Clustering

3/11/10, BYU 21

Page 22: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

TRAC-DS (Temporal Relationship Among Clusters for Data Streams)

3/11/10, BYU 22

Page 23: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Motivation

Temporal Ordering is a major feature of stream data.

Many stream applications depend on this ordering

Prediction of future values Anomaly (rare event) detection Concept drift

3/11/10, BYU 23

Page 24: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Stream Clustering Requirements

Dynamic updating of the clusters Completely online Identify outliers Identify concept drifts Barbara [2]:

compactness fast incremental processing

3/11/10, BYU 24

Page 25: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Data Stream Clustering

At each point in time a data stream clustering ζ is a partitioning of D', the data seen thus far.

Instead of the whole partitions C1, C2,..., Ck only synopses Cc1,Cc2,...,Cck are available and k is allowed to change over time.

The summaries Cci with i =1, 2,...,k typically contain information about the size, distribution and location of the data points in Ci.

3/11/10, BYU 25

Page 26: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

TRAC-DS NOTE

TRAC-DS is not: Another stream clustering

algorithm TRAC-DS is:

A new way of looking at clustering Built on top of an existing clustering

algorithm TRAC-DS may be used with any

stream clustering algorithm

3/11/10, BYU 26

Page 27: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

TRAC-DS Overview

3/11/10, BYU 27

Page 28: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

TRAC-DS Definition

Given a data stream clustering ζ, a temporal relationship among clusters (TRAC-DS) overlays a data stream clustering ζ with a EMM M, in such a way that the following are satisfied:

(1) There is a one-to-one correspondence between the clusters in ζ and the states S in M.

(2) A transition aij in the EMM M represents the probability that given a data point in cluster i, the next data point in the data stream will belong to cluster j with i; j = 1; 2; : : : ; k.

(3) The EMM M is created online together with the data stream clustering

3/11/10, BYU 28

Page 29: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Stream Clustering Operations *

qassign point(ζ,x): Assigns the new data point x to an existing cluster.

qnew cluster(ζ,x): Create a new cluster. qremove cluster(ζ,x): Removes a cluster. Here x

is the cluster, i, to be removed. In this case the associated summary Cci is removed from ζ and k is decremented by one.

qmerge clusters(ζ,x): Merges two clusters. qfade clusters(ζ,x): Fades the cluster structure. qsplit clusters(ζ,x): Splits a cluster.

* Inspired by MONIC [13]3/11/10, BYU 29

Page 30: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

TRAC-DS Operations

rassign point(M,sc,y): Assigns the new data point to the state representing an existing cluster

rnew cluster(M,sc,y): Create a state for a new cluster.

rremove cluster(M,sc,y): Removes state. rmerge clusters(M,sc,y): Merges two states. rfade clusters(M,sc,y): Fades the transition

probabilities using an exponential decay f(t)=2−λt

rsplit clusters(M,sc,y): Splits states. Y clustering operations.

3/11/10, BYU 30

Page 31: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

TRAC-DS Example

3/11/10, BYU 31

Page 32: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Objectives/Outline

EMM Overview EMM + Stream Clustering

EMM + Bioinformatics Background Preprocessing Classification Differentiation

3/11/10, BYU 32

Page 33: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

DNA

Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides Two strands bound together

3/11/10, BYU 33

http://www.visionlearning.com/library/module_viewer.php?mid=63

Page 34: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Central Dogma: DNA -> RNA -> Protein

3/11/10, BYU 34

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

Amino Acid

CCUGAGCCAACUAUUGAUGAA

www.bioalgorithms.info; chapter 6; Gene Prediction

Page 35: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 36

RNARibonucleic AcidContains A,C,G but U (Uracil) instead

of TSingle Stranded May fold back on itselfNeeded to create proteinsMove around cells – can act like a

messengermRNA – moves out of nucleus to

other parts of cell

Page 36: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

37

The Magical 16s Ribosomal RNA (rRNA) is at the heart of the

protein creation process 16S rRNA

About 1542 nucleotides in length In all living organisms Important in the classification of

organisms into phyla and class PROBLEM: An organism may actually

contain many different copies of 16S, each slightly different.

OUR WORK: Can we use EMM to quantify this diversity? Can we use it to classify different species of the same genus?

3/11/10, BYU

Page 37: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU

Using EMM with RNA Data

acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga

Moving Window

A C G T

Pos 0-8 2 3 3 1

Pos 1-9 1 3 3 2

…Pos 34-42 2 4 2 1

Construct EMM with nodes representing clusters of count vectors

38

Page 38: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

39

EMM for Classification

3/11/10, BYU

Page 39: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

TRAC-DS and Bioinformatics

Efficient Alignment free sequence analysis Clustering reduces size of model

Flexible Any sequence Applicability to Metagenomics

Scoring based on similarity between EMMs or EMM and input sequence

Applications Classification Differentiation

3/11/10, BYU 40

Page 40: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Profile EMMs for Organism Classification

3/11/10, BYU 41

Page 41: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

Profile EMM – E Coli

3/11/10, BYU 42

Page 42: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

43

Differentiating Strains

Is it possible to identify different species of same genus?

Initial test with EMM:

Bacillus has 21 species

Construct EMM for each species using training set (64%)

Test by matching unknown strains (36%) and place in closest EMM

All unknown strains correctly classified except one: accuracy of 95%

3/11/10, BYU

Page 43: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 44

Bibliography

1) C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. Proceedings of the International Conference on Very Large Data Bases (VLDB), pp 81-92, 2003.

2) D. Barbara, “Requirements for clustering data streams,” SIGKDD Explorations, Vol 3, No 2, pp 23-27, 2002.

3) Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle, “Visualization of DNA/RNA Structure using Temporal CGRs,”Proceedings of the IEEE 6th Symposium on Bioinformatics & Bioengineering (BIBE06), October 16-18, 2006, Washington D.C. ,pp 171-178.

4) S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams: Theory and practice,” IEEE Transactions on Knowledge and Data Engineering, Vol 15, No 3, pp 515-528, 2003.

5) Michael Hahsler and Margaret H. Dunham, “TRACDS: Temporal Relationship Among Clusters for Data Streams,” October 2009, submitted to SIAM International Conference on Data Mining.

6) Jie Huang, Yu Meng, and Margaret H. Dunham, “Extensible Markov Model,” Proceedings IEEE ICDM Conference, November 2004, pp 371-374.

7) Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,” International Journal of Computer Science and Network Security, Vol 6, No 6, June 2006, pp 258-265.

8) Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” July 2009, Proceedings of the IEEE MLDM Conference, pp 440-453.

9) Mallik Kotamarti, Douglas W. Raiford, M. L. Raymer, and Margaret H. Dunham, “A Data Mining Approach to Predicting Phylum for Microbial Organisms Using Genome-Wide Sequence Data,” Proceedings of the IEEE Ninth International Conference on Bioinformatics and Bioengineering, pp 161-167, June 22-24 2009.

10) Yu Meng and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Spatiotemporal,” Proceedings of the IEEE PAKDD Conference, April 2006, Singapore. (Also in Lecture Notes in Computer Science, Vol 3918, 2006, Springer Berlin/Heidelberg, pp 750-754.)

11) Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,” Journal of Computers, Vol 1, No 3, June 2006, pp 43-50.

12) MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/index.html, (2008)

13) M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pages 706–711, 2006.

Page 44: 3/11/10, BYU1 The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas,

3/11/10, BYU 45