Mining Dynamics of Data Streams in Multidimensional Space ...

Mining Dynamics of Data Streams in Multidimensional Space

Jiawei Han

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

04/13/23 Mining Dynamics in Data Streams 2

Outline

Characteristics of data streams

Mining dynamics in data streams

Multi-dimensional (regression) analysis of data

streams

Stream cubing and stream OLAP methods

Mining other kinds of patterns in data streams

Research problems


Characteristics of Data Streams

Data Streams Data streams—continuous, ordered, changing, fast, huge amount

Traditional DBMS—data stored in finite, persistent data setsdata sets

Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can

only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in

nature, needs multi-level and multi-dimensional processing


Stream Data Applications

Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &

manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too

expensive)


Architecture: Stream Query Processing

Scratch SpaceScratch Space(Main memory and/or Disk)(Main memory and/or Disk)

User/ApplicationUser/ApplicationUser/ApplicationUser/Application

Continuous QueryContinuous Query

Stream QueryStream QueryProcessorProcessor

ResultsResults

Multiple streamsMultiple streams

SDMS (Stream Data Management System)


Challenges of Stream Data Processing

Multiple, continuous, rapid, time-varying, ordered streams Main memory computations Queries are often continuous

Evaluated continuously as stream data arrives Answer updated over time

Queries are often complex Beyond element-at-a-time processing Beyond stream-at-a-time processing Beyond relational queries (scientific, data mining, OLAP)

Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-dimensional in

nature


Processing Stream Queries

Query types One-time query vs. continuous query (being evaluated

continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)

Unbounded memory requirements For real-time response, main memory algorithm should be used Memory requirement is unbounded if one will join future tuples

Approximate query answering With bounded memory, it is not always possible to produce exact

answers High-quality approximate answers are desired Data reduction and synopsis construction methods

Sketches, random sampling, histograms, wavelets, etc.


Projects on DSMS (Data Stream Management System)

Research projects and system prototypes STREAMSTREAM (Stanford): A general-purpose DSMS CougarCougar (Cornell): sensors AuroraAurora (Brown/MIT): sensor monitoring, dataflow Hancock Hancock (AT&T): telecom streams NiagaraNiagara (OGI/Wisconsin): Internet XML databases OpenCQOpenCQ (Georgia Tech): triggers, incr. view maintenance TapestryTapestry (Xerox): pub/sub content-based filtering TelegraphTelegraph (Berkeley): adaptive engine for sensors TradebotTradebot (www.tradebot.com): stock tickers & streams TribecaTribeca (Bellcore): network monitoring Streaminer Streaminer (UIUC): new project for stream data mining


Stream Data Mining vs. Stream Querying

Stream mining—A more challenging task It shares most of the difficulties with stream querying Patterns are hidden and more general than querying It may require exploratory analysis

Not necessarily continuous queries

Stream data mining tasks Multi-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream data Clustering data streams Classification of stream data


Stream Data Mining Tasks

Multi-dimensional (on-line) analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams Mining sequential patterns in data streams Mining partial periodicity in data streams Mining notable gradients in data streams Mining outliers and unusual patterns in data streams ……, more?


Challenges for Mining Dynamics in Data Streams

Most stream data are at pretty low-level or multi-

dimensional in nature: needs ML/MD processing

Analysis requirements

Multi-dimensional trends and unusual patterns

Capturing important changes at multi-dimensions/levels

Fast, real-time detection and response

Comparing with data cube: Similarity and differences

Stream (data) cube or stream OLAP: Is this feasible?

Can we implement it efficiently?


Multi-Dimensional Stream Analysis: Examples

Analysis of Web click streams Raw data at low levels: seconds, web page addresses, user IP

addresses, … Analysts want: changes, trends, unusual patterns, at reasonable

levels of details E.g., Average clicking traffic in North America on sports in the last

15 minutes is 40% higher than that in the last 24 hours.”

Analysis of power consumption streams Raw data: power consumption flow for every household, every

minute Patterns one may find: average hourly power consumption

surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago


A Key Step—Stream Data Reduction

Challenges of OLAPing stream data Raw data cannot be stored

Simple aggregates are not powerful enough

History shape and patterns at different levels are desirable:

multi-dimensional regression analysis

Proposal A scalable multi-dimensional stream “data cube” that can

aggregate regression model of stream data efficiently without

accessing the raw data

Stream data compression Compress the stream data to support memory- and time-

efficient multi-dimensional regression analysis


Regression Cube for Time-Series

Initially, one time-series per base cell Too costly to store all these time-series Too costly to compute regression at multi-dimensional space

Regression cube Base cube: only store regression parameters of base cells (e.g.,

2 points vs. 1000 points) All the upper level cuboids can be computed precisely for linear

regression on both standard dimensions and time dimensions For quadratic regression, we need 5 points In general, we need:

where k = 2 for quadratic.

,2

3

2

)1( 2 kkkkk


Basics of General Linear Regression

n tuples in one cell: (xi , yi), i =1..n, where yi is the measure attribute to be analyzed

For sample i , a vector of k user-defined predictors ui:

The linear regression model:

where η is a k × 1 vector of regression parameters

1,

1

1

1

0

...

1

...

ki

i

ik

ii

u

u

u

u

u

x

xu

1,1110 ...| kikiiT

ii uuyE uu


Stock Price Example—Aggregation in Standard Dimensions

Simple linear regression on time series dataCells of two companies

After aggregation:


Stock Price Example—Aggregation in Time Dimension

Cells of two adjacent time intervals:

After aggregation


Feasibility of Stream Regression Analysis

Efficient storage and scalable (independent of the

number of tuples in data cells)

Lossless aggregation without accessing the raw data

Fast aggregation: computationally efficient

Regression models of data cells at all levels

General results: covered a large and the most popular

class of regression

Including quadratic, polynomial, and nonlinear models


A Stream Cube Architecture

A tilted time frame Different time granularities

second, minute, quarter, hour, day, week, …

Critical layers Minimum interest layer (m-layer) Observation layer (o-layer) User: watches at o-layer and occasionally needs to drill-down

down to m-layer

Partial materialization of stream cubes Full materialization: too space and time consuming No materialization: slow response at query time Partial materialization: what do we mean “partial”?


A Tilted Time-Frame Model

31 days 24 hours 4 qtrs12 months

Time Now

24hrs 4qtrs 15minutes7 days

Time Now

25sec.Up to 7 days:

Up to a year:

Logarithmic (exponential) scale:2t 1t

Time Now

4t8t16t


Benefits of Tilted Time-Frame Model

Each cell stores the measures according to tilt-time-frame

Limited memory space: Impossible to store the history in full scale

Emphasis more on recent data

Most applications emphasize on recent data (slide window)

Natural partition on different time granularities

Putting different weights on remote data

Useful even for uniform weight

Tilted time-frame forms a new time dimension

for mining changes and evolutions

Essential for mining dynamic patterns or outliers

Finding those with dramatic changes

E.g., exceptional stocks—not following the trends


Two Critical Layers in the Stream Cube

(*, theme, quarter)

(user-group, URL-group, minute)

m-layer (minimal interest)

(individual-user, URL, second)

(primitive) stream data layer

o-layer (observation)


On-Line Materialization vs. On-Line Computation

On-line materialization

Materialization takes precious resources and time

Only incremental materialization (with slide window)

Only materialize “cuboids” of the critical layers?

Some intermediate cells that should be materialized

Popular path approach vs. exception cell approach

Materialize intermediate cells along the popular paths

Exception cells: how to set up exception thresholds?

Notice exceptions do not have monotonic behavior

Computation problem

How to compute and store stream cubes efficiently?

How to discover unusual cells between the critical layer?


Stream Cube Structure: from m-layer to o-layer

(A1, *, C1)

(A1, *, C2) (A1, *, C2)

(A1, *, C2)

(A1, B1, C2) (A1, B2, C1)

(A2, *, C2)

(A2, B1, C1)

(A1, B2, C2)

(A2, B1, C2)

A2, B2, C1)

(A2, B2, C2)


Stream Cube Computation

Cube structure from m-layer to o-layer

Three approaches All cuboids approach

Materializing all cells (too much in both space and time)

Exceptional cells approach

Materializing only exceptional cells (saves space but not time

to compute and definition of exception is not flexible)

Popular path approach

Computing and materializing cells only along a popular path

Using H-tree structure to store computed cells (which form the

stream cube—a selectively materialized cube)


An H-Tree Cubing Structure

root

entertainmentsports politics

uic uiuc uic uiuc

jeff

mary jeff Jim

Q.I.Q.I. Q.I.

Regression:

Sum: xxxxCnt: yyyy

Quant-Info

Observation layer

Minimal int. layer


Benefits of H-Tree and H-Cubing

H-tree and H-cubing Developed for computing data cubes and ice-berg cubes

J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation of Iceberg Cubes with Complex Measures”, SIGMOD'01

Compressed database Fast cubing Space preserving in cube computation

Using H-tree for stream cubing Space preserving

Intermediate aggregates can be computed incrementally and saved in tree nodes

Facilitate computing other cells and multi-dimensional analysis H-tree with computed cells can be viewed as stream cube


1

10

100

1000

25 50 100 200 400

Size (in K tuples)

Run

tim

e (i

n se

cond

s)

popular-pathexception-cellsall-cubing

0

200

400

600

25 50 100 200 400

Size (in K tuples)

Mem

ory

usag

e (i

n M

B)

popular-path

exception-cells

all-cubing

a) Time vs. m-layer size b) Space vs. m-layer size

Time and Space vs. Number of Tuples at the m-Layer (Dataset D3L3C10T400K)


1

10

100

1000

10000

3 4 5 6 7

Number of levels

Run

tim

e (i

n se

cond

s)


1

10

100

1000

10000

3 4 5 6 7

Number of levels

Mem

ory

usag

e (i

n M

B)


Time and Space vs. the Number of Levels

a) Time vs. # levels b) Space vs. # levels


Mining Other Dynamic Patterns in Stream Data

Clustering and outlier analysis for stream mining

Clustering data streams (Guha, Motwani et al. 2000-2002)

History-sensitive, high-quality incremental clustering

Classification of stream data

Evolution of decision trees: Domingos et al. (2000, 2001)

Incremental integration of new streams in decision-tree induction

Frequent pattern analysis

Approximate frequent patterns (Manku & Motwani VLDB’02)

Evolution and dramatic changes of frequent patterns


Clustering for Mining Stream Dynamics

Network intrusion detection: one example Detect bursts of activities or abrupt changes in real time—by on-

line clustering

Our methodology Tilted time frame work: o.w. dynamic changes cannot be found

Micro-clustering: better quality than k-means/k-median

incremental, online processing and maintenance)

Two stages: micro-clustering and macro-clustering

With limited “overhead” to achieve high efficiency, scalability,

quality of results and power of evolution/change detection


Classification for Dynamic Data Streams

Decision tree induction for stream data classification VFDT (Very Fast Decision Tree)/CVFDT (Domingos, Hulten,

Spencer, KDD00/KDD01)

Is decision-tree good for modeling fast changing data, e.g., stock

market analysis?

Methodology Tilted time framework

Instead of decision-trees, should we consider other models which

do not changes drastically, e.g., Naïve Bayesian with boosting?

Incremental updating, dynamic maintenance, and model

construction

Comparing of models to find changes


Frequent Patterns for Stream Data

Frequent pattern mining is valuable in stream applications e.g., network intrusion mining (Dokas, et al’02)

Mining precise freq patterns in stream data: unrealistic Even store them in a compressed form, such as FPtree

How to mine frequent patterns with good approximation? Approximate frequent patterns (Manku & Motwani VLDB’02)

Keep only current frequent patterns? No changes can be detected

Our approach

Keep Pattern-trees at the tilted time window frame (using tree-

sharing method)

Mining evolution and dramatic changes of frequent patterns


Conclusions

Stream data mining: A rich and largely unexplored field Current research focus in database community:

DSMS system architecture, continuous query processing, supporting mechanisms

Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems

Our philosophy: A multi-dimensional stream analysis framework Time is a special dimension: tilted time frame What to compute and what to save?—Critical layers Very partial materialization/precomputation: popular path approach Mining dynamics of stream data


References

B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data

stream systems”, PODS'02 (tutorial).

S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record,

30:109--120, 2001.

Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing

stream data: Is it feasible?”, DMKD'02.

Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression

analysis of time-series data streams”, VLDB'02.

P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00.

M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only

get one look”, SIGMOD'02 (tutorial).

J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over

continuous data streams”, SIGMOD'01.

S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”,

FOCS'00.

G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01.


www.cs.uiuc.edu/~hanj

Thank you !!!Thank you !!!

Mining Dynamics of Data Streams in Multidimensional Space ...

Documents

Transcript of Mining Dynamics of Data Streams in Multidimensional Space ...