A Framework for Clustering Evolving Data Streams
description
Transcript of A Framework for Clustering Evolving Data Streams
![Page 1: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/1.jpg)
A Framework for Clustering Evolving Data Streams
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu
Presented by: Di Yang
Charudatta Wad
![Page 2: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/2.jpg)
Outline
Background of ClusteringMotivation for Clustering over Streaming
Data.Overall SolutionMicro ClustersPyramid Time FrameMacro ClusterCluster Maintenance
![Page 3: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/3.jpg)
Background of Clustering
Definition of Clustering For a given set of data points, partitioning them
into one or more groups of similar objects. “Similarity” is often defined with the use of some
distance measure.
Difference between “group by” queries and clustering.
![Page 4: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/4.jpg)
Background of Clustering
Some of the most popular clustering algorithms: K- Means, BIRCH, CURE, Density Based
Clustering.
Clustering has many applications in data bases, information visualization, data mining.
What are Oultiers?
![Page 5: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/5.jpg)
Motivation
Challenge in Streaming Environment: Clustering is an expensive process. Resource constraints. Infinite streams.
Can simply extending one pass algorithms for static databases to stream processing suffice?
![Page 6: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/6.jpg)
Motivation
Requirements of clustering for stream processing: Statistical summary information storage. Efficient update process. Ability to cluster for a specific time horizon,
![Page 7: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/7.jpg)
Overall Solution of the Paper
Divide the clustering process to two phases
Online Component:
periodically stores detailed summary statistics Offline Component uses only the summary statistics to do clustering
![Page 8: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/8.jpg)
Micro-Clusters
What is a Micro-Cluster A Micro-Cluster is a set of individual data points that are
close to each other and will be treated as a single unit in further offline Macro-clustering.
View of Micro-Cluster View of Macro-Cluster
![Page 9: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/9.jpg)
Micro-Clusters
What to Store in a Micro-Cluster
=
Key idea: Additivity Property
![Page 10: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/10.jpg)
Pyramidal Time Frame
The snapshots follow a pyramidal pattern
… …
When should we make the snapshot?
The micro-clusters are stored at snapshots.
Snapshot
![Page 11: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/11.jpg)
Pyramidal Time Frame
Snapshots are classified into different orders which can vary from 1 to log α(T). For example, T is 55, α=2, then we have orders 0 with interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval 2^5=32.
For a data stream the maximum number of snap- shots maintained at T time units since the beginning of the stream mining process is
(α + 1) log α(T). (α + 1 for each order)
![Page 12: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/12.jpg)
Why Pyramidal Pattern?
For any user-specified time window of h, at least one stored snapshot can be found within 2 h units of the current time.
Please Note: Only Approximate Answers!!!
![Page 13: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/13.jpg)
Micro Cluster Creation
It is assumed that a total of q micro-clusters are maintained at any moment by the algorithm.
This is done using an offline process (k-means) at the very beginning of the data stream computation process.
![Page 14: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/14.jpg)
Online Micro Cluster Maintenance
How to deal with a new coming point?
1. Join one of the old cluster
2. Create a new cluster by its own
How to deal with the old clusters 1. Delete them (based on relevance stamp)
2. Merge them (merge the closest two)
A merged cluster will have all the IDs its components have
![Page 15: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/15.jpg)
Macro-Cluster Creation
Based on the Additivity Property of cluster feature vector
![Page 16: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/16.jpg)
Macro-Cluster Creation
Current Time T, the window size is h. That means the user want to find the clusters formed in (T-h, T).
Approach: 1. 1st step: Find the snapshot for T, get the micro-cluster set S(T).
2. 2nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h).
3. Use S(T)-S(T-h)
Specifically, we have a merged cluster with Id list (C1, C2, C3) in S(T)
and a cluster with Id C1 in S(T-h). Then the we use
CFT(C1,C2,C3)-CFT(C1)=CFT(C2,C3), because C1 are formed before
T-h, thus should not contribute to the micro-cluster formed in (T-h,T)
![Page 17: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/17.jpg)
Example
C_ID: [C1]
Time: T-h
C_ID: [C1, C2, C3]
Time: T
C_ID: [C2, C3]
Result: T-h
![Page 18: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/18.jpg)
Macro-Cluster Creation
Run K-means on Micro-Clusters
![Page 19: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/19.jpg)
How do you feel about this paper?
My feeling:
Quite Fuzzy Results:
Approximation is every where.
Nothing New:
Micro-Clusters, K-means, Cluster Feature Vectors, Pyramidal Time Frame are all old stuffs.
![Page 20: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/20.jpg)
Counter Example
C_ID: [C2] C_ID: [C1, C2, C3]
Time: T
C_ID: [C1, C3]
Time: T-hResult
![Page 21: A Framework for Clustering Evolving Data Streams](https://reader034.fdocuments.us/reader034/viewer/2022051316/568151df550346895dc017f3/html5/thumbnails/21.jpg)
Advertisement
Di and Charu’s project deals with:
1. Deterministic Clusters
2. Clusters with Arbitrary Shapes
3. Real Expirations
4. Disk Version
5. Outlier Detection by Free