Mining Dynamics of Data Streams in Multidimensional Space ...
Transcript of Mining Dynamics of Data Streams in Multidimensional Space ...
Mining Dynamics of Data Streams in Multidimensional Space
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
04/13/23 Mining Dynamics in Data Streams 2
Outline
Characteristics of data streams
Mining dynamics in data streams
Multi-dimensional (regression) analysis of data
streams
Stream cubing and stream OLAP methods
Mining other kinds of patterns in data streams
Research problems
04/13/23 Mining Dynamics in Data Streams 3
Characteristics of Data Streams
Data Streams Data streams—continuous, ordered, changing, fast, huge amount
Traditional DBMS—data stored in finite, persistent data setsdata sets
Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can
only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in
nature, needs multi-level and multi-dimensional processing
04/13/23 Mining Dynamics in Data Streams 4
Stream Data Applications
Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &
manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too
expensive)
04/13/23 Mining Dynamics in Data Streams 5
Architecture: Stream Query Processing
Scratch SpaceScratch Space(Main memory and/or Disk)(Main memory and/or Disk)
User/ApplicationUser/ApplicationUser/ApplicationUser/Application
Continuous QueryContinuous Query
Stream QueryStream QueryProcessorProcessor
ResultsResults
Multiple streamsMultiple streams
SDMS (Stream Data Management System)
04/13/23 Mining Dynamics in Data Streams 6
Challenges of Stream Data Processing
Multiple, continuous, rapid, time-varying, ordered streams Main memory computations Queries are often continuous
Evaluated continuously as stream data arrives Answer updated over time
Queries are often complex Beyond element-at-a-time processing Beyond stream-at-a-time processing Beyond relational queries (scientific, data mining, OLAP)
Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-dimensional in
nature
04/13/23 Mining Dynamics in Data Streams 7
Processing Stream Queries
Query types One-time query vs. continuous query (being evaluated
continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)
Unbounded memory requirements For real-time response, main memory algorithm should be used Memory requirement is unbounded if one will join future tuples
Approximate query answering With bounded memory, it is not always possible to produce exact
answers High-quality approximate answers are desired Data reduction and synopsis construction methods
Sketches, random sampling, histograms, wavelets, etc.
04/13/23 Mining Dynamics in Data Streams 8
Projects on DSMS (Data Stream Management System)
Research projects and system prototypes STREAMSTREAM (Stanford): A general-purpose DSMS CougarCougar (Cornell): sensors AuroraAurora (Brown/MIT): sensor monitoring, dataflow Hancock Hancock (AT&T): telecom streams NiagaraNiagara (OGI/Wisconsin): Internet XML databases OpenCQOpenCQ (Georgia Tech): triggers, incr. view maintenance TapestryTapestry (Xerox): pub/sub content-based filtering TelegraphTelegraph (Berkeley): adaptive engine for sensors TradebotTradebot (www.tradebot.com): stock tickers & streams TribecaTribeca (Bellcore): network monitoring Streaminer Streaminer (UIUC): new project for stream data mining
04/13/23 Mining Dynamics in Data Streams 9
Stream Data Mining vs. Stream Querying
Stream mining—A more challenging task It shares most of the difficulties with stream querying Patterns are hidden and more general than querying It may require exploratory analysis
Not necessarily continuous queries
Stream data mining tasks Multi-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream data Clustering data streams Classification of stream data
04/13/23 Mining Dynamics in Data Streams 10
Stream Data Mining Tasks
Multi-dimensional (on-line) analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams Mining sequential patterns in data streams Mining partial periodicity in data streams Mining notable gradients in data streams Mining outliers and unusual patterns in data streams ……, more?
04/13/23 Mining Dynamics in Data Streams 11
Challenges for Mining Dynamics in Data Streams
Most stream data are at pretty low-level or multi-
dimensional in nature: needs ML/MD processing
Analysis requirements
Multi-dimensional trends and unusual patterns
Capturing important changes at multi-dimensions/levels
Fast, real-time detection and response
Comparing with data cube: Similarity and differences
Stream (data) cube or stream OLAP: Is this feasible?
Can we implement it efficiently?
04/13/23 Mining Dynamics in Data Streams 12
Multi-Dimensional Stream Analysis: Examples
Analysis of Web click streams Raw data at low levels: seconds, web page addresses, user IP
addresses, … Analysts want: changes, trends, unusual patterns, at reasonable
levels of details E.g., Average clicking traffic in North America on sports in the last
15 minutes is 40% higher than that in the last 24 hours.”
Analysis of power consumption streams Raw data: power consumption flow for every household, every
minute Patterns one may find: average hourly power consumption
surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago
04/13/23 Mining Dynamics in Data Streams 13
A Key Step—Stream Data Reduction
Challenges of OLAPing stream data Raw data cannot be stored
Simple aggregates are not powerful enough
History shape and patterns at different levels are desirable:
multi-dimensional regression analysis
Proposal A scalable multi-dimensional stream “data cube” that can
aggregate regression model of stream data efficiently without
accessing the raw data
Stream data compression Compress the stream data to support memory- and time-
efficient multi-dimensional regression analysis
04/13/23 Mining Dynamics in Data Streams 14
Regression Cube for Time-Series
Initially, one time-series per base cell Too costly to store all these time-series Too costly to compute regression at multi-dimensional space
Regression cube Base cube: only store regression parameters of base cells (e.g.,
2 points vs. 1000 points) All the upper level cuboids can be computed precisely for linear
regression on both standard dimensions and time dimensions For quadratic regression, we need 5 points In general, we need:
where k = 2 for quadratic.
,2
3
2
)1( 2 kkkkk
04/13/23 Mining Dynamics in Data Streams 15
Basics of General Linear Regression
n tuples in one cell: (xi , yi), i =1..n, where yi is the measure attribute to be analyzed
For sample i , a vector of k user-defined predictors ui:
The linear regression model:
where η is a k × 1 vector of regression parameters
1,
1
1
1
0
...
1
...
ki
i
ik
ii
u
u
u
u
u
x
xu
1,1110 ...| kikiiT
ii uuyE uu
04/13/23 Mining Dynamics in Data Streams 16
Stock Price Example—Aggregation in Standard Dimensions
Simple linear regression on time series dataCells of two companies
After aggregation:
04/13/23 Mining Dynamics in Data Streams 17
Stock Price Example—Aggregation in Time Dimension
Cells of two adjacent time intervals:
After aggregation
04/13/23 Mining Dynamics in Data Streams 18
Feasibility of Stream Regression Analysis
Efficient storage and scalable (independent of the
number of tuples in data cells)
Lossless aggregation without accessing the raw data
Fast aggregation: computationally efficient
Regression models of data cells at all levels
General results: covered a large and the most popular
class of regression
Including quadratic, polynomial, and nonlinear models
04/13/23 Mining Dynamics in Data Streams 19
A Stream Cube Architecture
A tilted time frame Different time granularities
second, minute, quarter, hour, day, week, …
Critical layers Minimum interest layer (m-layer) Observation layer (o-layer) User: watches at o-layer and occasionally needs to drill-down
down to m-layer
Partial materialization of stream cubes Full materialization: too space and time consuming No materialization: slow response at query time Partial materialization: what do we mean “partial”?
04/13/23 Mining Dynamics in Data Streams 20
A Tilted Time-Frame Model
31 days 24 hours 4 qtrs12 months
Time Now
24hrs 4qtrs 15minutes7 days
Time Now
25sec.Up to 7 days:
Up to a year:
Logarithmic (exponential) scale:2t 1t
Time Now
4t8t16t
04/13/23 Mining Dynamics in Data Streams 21
Benefits of Tilted Time-Frame Model
Each cell stores the measures according to tilt-time-frame
Limited memory space: Impossible to store the history in full scale
Emphasis more on recent data
Most applications emphasize on recent data (slide window)
Natural partition on different time granularities
Putting different weights on remote data
Useful even for uniform weight
Tilted time-frame forms a new time dimension
for mining changes and evolutions
Essential for mining dynamic patterns or outliers
Finding those with dramatic changes
E.g., exceptional stocks—not following the trends
04/13/23 Mining Dynamics in Data Streams 22
Two Critical Layers in the Stream Cube
(*, theme, quarter)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
o-layer (observation)
04/13/23 Mining Dynamics in Data Streams 23
On-Line Materialization vs. On-Line Computation
On-line materialization
Materialization takes precious resources and time
Only incremental materialization (with slide window)
Only materialize “cuboids” of the critical layers?
Some intermediate cells that should be materialized
Popular path approach vs. exception cell approach
Materialize intermediate cells along the popular paths
Exception cells: how to set up exception thresholds?
Notice exceptions do not have monotonic behavior
Computation problem
How to compute and store stream cubes efficiently?
How to discover unusual cells between the critical layer?
04/13/23 Mining Dynamics in Data Streams 24
Stream Cube Structure: from m-layer to o-layer
(A1, *, C1)
(A1, *, C2) (A1, *, C2)
(A1, *, C2)
(A1, B1, C2) (A1, B2, C1)
(A2, *, C2)
(A2, B1, C1)
(A1, B2, C2)
(A2, B1, C2)
A2, B2, C1)
(A2, B2, C2)
04/13/23 Mining Dynamics in Data Streams 25
Stream Cube Computation
Cube structure from m-layer to o-layer
Three approaches All cuboids approach
Materializing all cells (too much in both space and time)
Exceptional cells approach
Materializing only exceptional cells (saves space but not time
to compute and definition of exception is not flexible)
Popular path approach
Computing and materializing cells only along a popular path
Using H-tree structure to store computed cells (which form the
stream cube—a selectively materialized cube)
04/13/23 Mining Dynamics in Data Streams 26
An H-Tree Cubing Structure
root
entertainmentsports politics
uic uiuc uic uiuc
jeff
mary jeff Jim
Q.I.Q.I. Q.I.
Regression:
Sum: xxxxCnt: yyyy
Quant-Info
Observation layer
Minimal int. layer
04/13/23 Mining Dynamics in Data Streams 27
Benefits of H-Tree and H-Cubing
H-tree and H-cubing Developed for computing data cubes and ice-berg cubes
J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation of Iceberg Cubes with Complex Measures”, SIGMOD'01
Compressed database Fast cubing Space preserving in cube computation
Using H-tree for stream cubing Space preserving
Intermediate aggregates can be computed incrementally and saved in tree nodes
Facilitate computing other cells and multi-dimensional analysis H-tree with computed cells can be viewed as stream cube
04/13/23 Mining Dynamics in Data Streams 28
1
10
100
1000
25 50 100 200 400
Size (in K tuples)
Run
tim
e (i
n se
cond
s)
popular-pathexception-cellsall-cubing
0
200
400
600
25 50 100 200 400
Size (in K tuples)
Mem
ory
usag
e (i
n M
B)
popular-path
exception-cells
all-cubing
a) Time vs. m-layer size b) Space vs. m-layer size
Time and Space vs. Number of Tuples at the m-Layer (Dataset D3L3C10T400K)
04/13/23 Mining Dynamics in Data Streams 29
1
10
100
1000
10000
3 4 5 6 7
Number of levels
Run
tim
e (i
n se
cond
s)
popular-pathexception-cellsall-cubing
1
10
100
1000
10000
3 4 5 6 7
Number of levels
Mem
ory
usag
e (i
n M
B)
popular-pathexception-cellsall-cubing
Time and Space vs. the Number of Levels
a) Time vs. # levels b) Space vs. # levels
04/13/23 Mining Dynamics in Data Streams 30
Mining Other Dynamic Patterns in Stream Data
Clustering and outlier analysis for stream mining
Clustering data streams (Guha, Motwani et al. 2000-2002)
History-sensitive, high-quality incremental clustering
Classification of stream data
Evolution of decision trees: Domingos et al. (2000, 2001)
Incremental integration of new streams in decision-tree induction
Frequent pattern analysis
Approximate frequent patterns (Manku & Motwani VLDB’02)
Evolution and dramatic changes of frequent patterns
04/13/23 Mining Dynamics in Data Streams 31
Clustering for Mining Stream Dynamics
Network intrusion detection: one example Detect bursts of activities or abrupt changes in real time—by on-
line clustering
Our methodology Tilted time frame work: o.w. dynamic changes cannot be found
Micro-clustering: better quality than k-means/k-median
incremental, online processing and maintenance)
Two stages: micro-clustering and macro-clustering
With limited “overhead” to achieve high efficiency, scalability,
quality of results and power of evolution/change detection
04/13/23 Mining Dynamics in Data Streams 32
Classification for Dynamic Data Streams
Decision tree induction for stream data classification VFDT (Very Fast Decision Tree)/CVFDT (Domingos, Hulten,
Spencer, KDD00/KDD01)
Is decision-tree good for modeling fast changing data, e.g., stock
market analysis?
Methodology Tilted time framework
Instead of decision-trees, should we consider other models which
do not changes drastically, e.g., Naïve Bayesian with boosting?
Incremental updating, dynamic maintenance, and model
construction
Comparing of models to find changes
04/13/23 Mining Dynamics in Data Streams 33
Frequent Patterns for Stream Data
Frequent pattern mining is valuable in stream applications e.g., network intrusion mining (Dokas, et al’02)
Mining precise freq patterns in stream data: unrealistic Even store them in a compressed form, such as FPtree
How to mine frequent patterns with good approximation? Approximate frequent patterns (Manku & Motwani VLDB’02)
Keep only current frequent patterns? No changes can be detected
Our approach
Keep Pattern-trees at the tilted time window frame (using tree-
sharing method)
Mining evolution and dramatic changes of frequent patterns
04/13/23 Mining Dynamics in Data Streams 34
Conclusions
Stream data mining: A rich and largely unexplored field Current research focus in database community:
DSMS system architecture, continuous query processing, supporting mechanisms
Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems
Our philosophy: A multi-dimensional stream analysis framework Time is a special dimension: tilted time frame What to compute and what to save?—Critical layers Very partial materialization/precomputation: popular path approach Mining dynamics of stream data
04/13/23 Mining Dynamics in Data Streams 35
References
B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data
stream systems”, PODS'02 (tutorial).
S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record,
30:109--120, 2001.
Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing
stream data: Is it feasible?”, DMKD'02.
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression
analysis of time-series data streams”, VLDB'02.
P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00.
M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only
get one look”, SIGMOD'02 (tutorial).
J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over
continuous data streams”, SIGMOD'01.
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”,
FOCS'00.
G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01.
04/13/23 Mining Dynamics in Data Streams 36
www.cs.uiuc.edu/~hanj
Thank you !!!Thank you !!!