Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15,...

60
Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006

Transcript of Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15,...

Page 1: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

Incremental Pattern Discovery on Streams, Graphs and

Tensors

Jimeng Sun

Ph.D.Thesis Proposal

May 15, 2006

Page 2: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

2

Thesis Committee

Christos Faloutsos (Chair)Tom MitchellHui ZhangDavid Steier, PricewaterhouseCoopers Philip Yu, IBM Watson Research Center

Page 3: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

3

Thesis ProposalGoal: incremental pattern discovery on streaming applications

Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring

Graphs: E3: Social network analysis

Tensors: E4: Network forensics E5: Financial auditing E6: fMRI: Brain image analysis

How to summarize streaming data efficiently and incrementally?

Page 4: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

4

E1: Environmental Sensor Monitoring

water distribution network

normal operation

May have hundreds of measurements, and they are often related!

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :

chlo

rine c

once

ntr

ati

ons

sensorsnear leak

sensorsawayfrom leak

CMU civil departmentProf. Jeanne M. VanBriesen

Page 5: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

5

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :

E1: Environmental Sensor Monitoring

water distribution network

normal operation major leak

chlo

rine c

once

ntr

ati

ons

sensorsnear leak

sensorsawayfrom leak

CMU civil departmentProf. Jeanne M. VanBriesen

May have hundreds of measurements, and they are often related!

Page 6: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

6

E1: Environmental Sensor Monitoring

We would like to discover a few “hidden(latent) variables” that summarize the key trends

chlo

rine c

once

ntr

ati

ons

Phase 1 Phase 1Phase 2 Phase 2Phase 3 Phase 3

actual measurements(n streams)

k hidden variable(s)

k = 1-2

: : : : : :

: : : : : :

SPIRIT

Page 7: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

7

E3: Social network analysisTraditionally, people focus on static networks and find community structuresWe plan to monitor the change of the community structure over time and identify abnormal individuals

DB

Aut

hors

Keywords

DM

DB

1990

2004

Page 8: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

8

E4: Network forensicsDirectional network flowsA large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004]

450 GB/hour with compression

Task: Identify abnormal traffic pattern and find out the cause

normal trafficabnormal traffic

dest

inati

on

source

dest

inati

on

sourceCollaboration with Prof. Hui Zhang and Dr. Yinglian Xie

Page 9: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

9

Commonality of all

Data: continuously arrivingLarge volumeMulti-dimensionalUnlabeled

Task: incremental pattern discoveryMain trendsAnomalies

Page 10: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

10

Thesis statement

Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.

Page 11: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

11

Outline

Motivating examplesData model and mining frameworkRelated workCurrent workProposed workConclusion

Page 12: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

12

Static Data model Tensor

Formally,

Generalization of matrices

Represented as multi-array, data cube.

Order 1st 2nd 3rd

Correspondence Vector Matrix 3D array

ExampleSensors

Aut

hors

Keywords

Sources

Des

tinat

ions

Por

ts

Page 13: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

13

Dynamic Data model (our focus)

Tensor StreamsA sequence of Mth order tensor

where

n is increasing over time

Order 1st 2nd 3rd

Correspondence

Multiple streams Time evolving graphs

3D arrays

Example

Sources

Des

tinat

ions

Por

tstime

Sensors

time

au

thor

keyword

Page 14: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

14

Application Modules

Our framework for incremental pattern discovery

DataStreams

TensorStreams

Core tensors

Pro

jectio

ns

Preprocessing Tensor Analysis

AnomalyDetection

Clustering Prediction

Mining flow

Page 15: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

15

Outline

Motivating examplesData model and mining frameworkRelated workCurrent workProposed workConclusion

Page 16: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

16

Related workLow Rank approximationPCA, SVD: orthogonal

based projectionCUR [Drineas05]:

example based projection

Multilinear analysisTensors: matricizing,

mode-productTensor decompositions:

Tucker, PARAFAC, HOSVD

Stream miningScan data once to

identify patternsSampling: [Vitter85],

[Gibbons98]Sketches: [Indyk00],

[Cormode03]

Graph miningExplorative: [Faloutsos04]

[Kumar99][Leskovec05]…

Algorithmic: [Yan05][Cormode05]…

Our Work

Page 17: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

17

Y

Background – Singular value decomposition (SVD)

SVD

Best rank k approximation in L2PCA is an important application of SVDNote that U and V are dense and may have negative entries

Am

n

m

nRR

R

UVT k

k k

UT

Page 18: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

18

Background – Latent semantic indexing (LSI)

Singular vectors are useful for clustering

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

pattern

cluster

querycache

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=DM

DB

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

document-conceptconcept-term

concept-association

frequent

Page 19: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

19

Background: Tensor Operations

MatricizingUnfold a tensor into a matrix

SourceDest

inati

onPo

rt

Source

Source

Dest

inati

on

*Port

Page 20: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

20

Background: Tensor Operations

Mode-productMultiply a tensor with a matrix

SourceDest

inati

onPo

rt

“group”

source

Dest

inati

onPo

rt

“group”so

urc

e

Page 21: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

21

OutlineData modelFrameworkRelated workCurrent work

Dynamic and Streaming tensor analysis (DTA/STA)Compact matrix decomposition (CMD)

Proposed workConclusion

Page 22: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

22

Methodology map

static dynamic

1st 1st order DTA,SPIRIT (1st order STA)

2nd SVD, PCA, CMD

DTA, STA3 PARAFAC,HOSVD,

TensorPCA

orderdata

Page 23: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

23

Tensor analysisGiven a sequence of tensorsfind the projection matricessuch that the reconstruction error e is minimized:

t

Note that this is a generalization of PCA when n is a constant

Sources

Des

tinat

ions

Por

ts

Source Projection

Des

tinat

ion

Pro

ject

ion

Port Projection

Core Tensor

Page 24: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

24

DB

Aut

hors

Keywords

DM

DB

UA

UK

1990

2004

1990

2004

Why do we care?

Anomaly detectionReconstruction error drivenMultiple resolution

Multiway latent semantic indexing (LSI) Philip Yu

Michael Stonebreak

er

QueryPattern

time

Page 25: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

25

1st order DTA - problemGiven x1…xn where each xi RN, find

URNR such that the error e is small:

n

N

x1

xn

….

?

tim

e

Sensors

UT

indooroutdoor

Y

Sensors

R

Note that Y = XU

Page 26: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

26

1st order DTAInput: new data vector x RN, old variance

matrix C RN N

Output: new projection matrix U RN R

Algorithm:1. update variance matrix Cnew = xTx + C2. Diagonalize UUT = Cnew 3. Determine the rank R and return U

xT C UUTx

Cnew

Diagonalization has to be done for every new x!

Old X

x

tim

e

Page 27: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

27

1st order STA: SPIRITAdjust U smoothly when new data arrive without diagonalization

For each new point xProject onto current lineEstimate errorRotate line in the direction of the error and in proportion to its magnitude

For each new point x and for i = 1, …, k : yi := Ui

Tx (proj. onto Ui)

di di + yi2 (energy i-th eigenval.)

ei := x – yiUi (error)

Ui Ui + (1/di) yiei (update estimate)

x x – yiUi (repeat with remainder)

error

U

Sensor 1

Sen

sor

2

Page 28: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

28

Mth order DTA

dU

TdU

Reconstruct Variance Matrix

dC

dC

Update Variance Matrix

dS

Diagonalize Variance Matrix

dU

TdU

dSX(d)X(d)

dX TdX

Mat

riciz

ing,

Tra

nspo

se

Construct Variance Matrix of Incremental Tensor

Matricizing

T

Page 29: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

29

Mth order DTA – complexityStorage: O( Ni), i.e., size of an input tensor at a single

timestampComputation: Ni

3 (or Ni2) diagonalization of C

+ Ni Ni matrix multiplication X (d)T X(d)

For low order tensor(<3), diagonalization is the main cost

For high order tensor, matrix multiplication is the main cost

Page 30: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

30

Streaming tensor analysis (STA)

TdX

Matricizing

Run SPIRIT along each modeComplexity:

Storage: O( Ni)

Computation: Ri Ni which is smaller than DTAy1

U1

xe1

U1 updated

Page 31: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

31

Experiment

GoalComputation efficiencyAccurate approximationReal applications

Anomaly detection Clustering

Page 32: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

32

Data set 1: Network dataTCP flows collected at CMU backboneRaw data 500GB with compressionConstruct 2nd or 3rd order tensors with hourly windows with <source, destination,value> or <source, destination, port, value>Each tensor: 500500 or 500500100 biased sampled from over 22k hosts1200 timestamps (hours)

Sparse data Power-law distribution10AM to 11AM on 01/06/2005

Page 33: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

33

Data set 2: Bibliographic data (DBLP)

Papers from VLDB and KDD conferencesConstruct 2nd order tensors with yearly windows with <author, keywords, num> Each tensor: 45843741 11 timestamps (years)

Page 34: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

34

Computational cost

3rd order network tensor 2nd order DBLP tensorOTA is the offline tensor analysisPerformance metric: CPU time (sec)Observations:

DTA and STA are orders of magnitude faster than OTAThe slide upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)

Page 35: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

35

Accuracy comparison

Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20%Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.

3rd order network tensor 2nd order DBLP tensor

Page 36: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

36

Network anomaly detection

Reconstruction error gives indication of anomalies.Prominent difference between normal and abnormal ones is mainly due to the unusual scanning activity (confirmed by the campus admin).

Reconstruction error over time

Normal trafficAbnormal traffic

Page 37: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

37

Multiway LSIAuthors Keywords Yearmichael carey, michaelstonebreaker, h. jagadish,hector garcia-molina

queri,parallel,optimization,concurr,objectorient

1995

surajit chaudhuri,mitch cherniack,michaelstonebreaker,ugur etintemel

distribut,systems,view,storage,servic,process,cache

2004

jiawei han,jian pei,philip s. yu,jianyong wang,charu c. aggarwal

streams,pattern,support, cluster, index,gener,queri

2004

Two groups are correctly identified: Databases and Data miningPeople and concepts are drifting over time

DB

DM

Page 38: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

38

Quick summary of DTA/STA

Tensor stream is a general data modelDTA/STA incrementally decompose tensors into core tensors and projection matricesThe result of DTA/STA can be used in other applications

Anomaly detectionMultiway LSI

Incremental computation!

Page 39: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

39

OutlineData modelFrameworkRelated workCurrent work

Dynamic and Streaming tensor analysis (DTA/STA)Compact matrix decomposition (CMD)

Proposed workConclusion

Page 40: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

40

Methodology map

static dynamic

1st 1st order DTA,SPIRIT (1st order STA)

2nd SVD, PCA, CMD

DTA, STA3 PARAFAC,HOSVD,

TensorPCA

orderdata

Page 41: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

41

Disadvantage of orthogonal projection on sparse data

Real data are often (very) sparse

Orthogonal projection does not preserve the sparsity in the data

more space than original datalarge computational cost

Data Size Nonzero percent

Network flow 22k-by-22k 0.0025%

DBLP (author, conference) 428k-by-3.6k 0.004%

Page 42: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

42

Interpretability problem of orthogonal projection

Each column of projection matrix Ui is a linear combination of all dimensions along certain mode Ui(:,1) = [0.5; -0.5; 0.5; 0.5]

All the data are projected onto the span of Ui

It is hard to interpret the projections

Page 43: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

43

Compact matrix decomposition (CMD)

Example-based projection: use actual rows and columns to specify the subspaceGiven a matrix ARmn, find three matrices C Rmc, U Rcr, R Rr n , such that ||A-CUR|| is small

C

RX

m

n

r

c

Am

n

U is the pseudo-inverse of X

Orthogonal projection

Example-based

Page 44: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

44

CMD algorithm (high level)

CMU from 4K feet

Page 45: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

45

CMD algorithm (high level)

Biased sample with replacement of columns and rows from ARemove duplicates with proper scalingConstruct U from C and R (pseudo-inverse of the intersection of C and R)

Remove duplicates with proper scaling

1111

1010

0011

A

1111

1111

1 1 11 1 1

Cd

Rd

C

2 2 2R

U

Page 46: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

46

CMD algorithm (low level)

CMU from 3 feet

Page 47: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

47

CMD algorithm (low level)

Remove duplicates with proper scaling

Cd

RdX

m

n

r

c

Ci = ui1/2 Ci

Ri = vi Ri

C

RX

m

n

r`

c`

Theorem: Matrix C and Cd have the same singular values and left singular vectorsProof: see [Sun06]

ui, vi the number of occurrences of Ci and Ri

Page 48: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

48

ExperimentDatasets

Performance metricsSpace ratio to the original dataCPU time (sec)Accuracy = 1 – reconstruction error

Data Dimension Nonzeros

Network flow(source, destination)

22k-by-22k 12K

DBLP(author, conference)

428K-by-3.6K 64K

Page 49: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

49

Space efficiency

CMD uses much smaller space to achieve the same accuracyCUR limitation: duplicate columns and rowsSVD limitation: orthogonal projection densifies the data

Network DBLP

Page 50: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

50

Computational efficiency

CMD is fastest among all threeCMD and CUR requires SVD on only the sampled columnsCUR is much worse than CMD due to duplicate columnsSVD is slowest since it performs on the entire data

Network DBLP

Page 51: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

51

Quick summary on CMDCMD: A C U R

C/R: sampled and scaled columns and rows (sparse)U: a small matrix (dense)

PropertiesInterpretability: interpret matrix by sampled rows and columns Efficiency: in computation and space

ApplicationAnomaly detection

Efficient computation,intuitive model

Page 52: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

52

My related publicationsSun, J., Tao, D., Faloutsos, C. Beyond Streams and Graphs: Dynamic Tensor Analysis, submitted.Sun, J., Xie, Y., Zhang, H., Faloutsos, C. Compact Matrix Decomposition for Large Graphs: Theory and Practice, submitted.Hoke, E., Sun, J., Faloutsos, C. Intemon: intelligent monitoring system for large clusters. submittedSun,  J., Papadimitriou,  S., Faloutsos,  C. Distributed Pattern Discovery in Multiple Streams, PAKDD 2006 Papadimitriou,  S., Sun,  J., Faloutsos,  C. Streaming Pattern Discovery in Multiple Time-Series, VLDB 2006Sun,.J.  Papadimitriou,  S.,  Faloutsos,  C.  Online  latent  variable detection in  sensor   networks,  ICDE, 2005

Page 53: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

53

OutlineMotivating examplesData model and mining frameworkBackground and related workCurrent work

Dynamic and Streaming tensor analysis (DTA/STA)Compact matrix decomposition (CMD)

Proposed workConclusion

Page 54: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

54

Proposed workMethodology

EvaluationGoal: real data, real application, real patterns

[P3]DTASTA

Tensor analysis

Orthogonalprojection

Example-basedprojection

Mth

Other divergence

SPIRITDistributed SPIRIT1s

t

[P4]CMD, [P1,P2]

2nd

MthMth

Page 55: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

55

P1: Effective example-based projection

Occasionally, CMD does not give an accurate result. Especially, when the “large” columns and rows are in near parallel space.Current heuristics keeps sample those “large” columns/rows

Recent work [Drineas06] provides relative error guarantee

|A-CUR| (1+)|A-Ak| where Ak is best k approximation from SVD

Our idea: pick the column that disagree the most with the selected columns.

1111

1111

0011

CMD New

Page 56: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

56

P2: Incremental CMD

Given time evolving graphs (2nd tensor stream), currently we need to apply CMD every timestamp on a new (slightly changed) graph How to compute CMD efficiently over time? Our idea: 1

111

1111

0011

t =11221

2120

0012

t =2

Page 57: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

57

P3: Example-based tensor decomposition

CMD is currently on matrices (2nd order tensors) only.Generalize CMD to higher order

Build infrastructure: sparse tensor package [Kolda 06]Prototype sparse tensor access methods

How to store a sparse tensor? How to access some subset of a tensor?

Our goal: Implement tensor CMD efficiently.

Page 58: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

58

P4: Other divergenceCurrently, the model implicitly assumes Gaussian distribution and Euclidean distance.But, many real data are not Gaussian. Our goal focus on other distribution and distance measure

Euclidean distance Gaussian distribution KL divergence Multinomial distributionBregman divergence Exponential family

Page 59: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

59

Evaluation planReal data, real application, real success or failure

Data Tasks Order

Environmental monitoring

Temperature, humidity in large building; chlorine concentration in water distribution; Real-time summarization and anomaly detection

1st

Machine monitoring

Monitor a number of system parameters; identify unusual patterns in real-time

1st

DBLP/IMDB Time evolving graphs; find community structure

2nd

Network flow Identify interesting patterns, identify attacks

2nd or 3rd

Other data

fMRI data Brain image data; classification 3rd

Financial data Transaction data; identify the anomalies that may indicate frauds or errors

>= 1

Page 60: Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

60

Timeline

1-3 months4-6 months7-8 months9-11 months7-12 monthsAfter 12 months

P1:Effective example-based projectionP2:Incremental CMDP3:Example-based tensor decompositionP4:other divergenceWriting thesisDefense

P1 P3P2 P4Writing thesis

12 months

Defense