Stream Clustering
description
Transcript of Stream Clustering
Stream Clustering
Stream ClusteringCSE 902Big Data
Stream analysisStream: Continuous flow of dataChallengesVolume: Not possible to store all the dataOne-time access: Not possible to process the data using multiple passesReal-time analysis: Certain applications need real-time analysis of the dataTemporal Locality: Data evolves over time, so model should be adaptive.Stream Clustering
Topic clusterArticleListingsStream Clustering
Online PhaseSummarize the data into memory-efficient data structures
Offline Phase Use a clustering algorithm to find the data partitionStream Clustering AlgorithmsData StructuresExamplesPrototypesStream, Stream LsearchCF-TreesScalable k-means, single pass k-meansMicrocluster TreesClusTree, DenStream, HP-StreamGridsD-Stream, ODACCoreset Tree
StreamKM++
Prototypes
Stream, LSearchCF-Trees Summarize the data in each CF-vector Linear sum of data points Squared sum of data points Number of points
Scalable k-means, Single pass k-meansMicroclustersCF-Trees with time element
CluStreamLinear sum and square sum of timestampsDelete old microclusters/merging microclusters if their timestamps are close to each other
Sliding Window ClusteringTimestamp of the most recent data point added to the vectorMaintain only the most recent T microclusters
DenStreamMicroclusters are associated with weights based on recencyOutliers detected by creating separate microcluster
MicroclustersCF-Trees with time element
DenStreamMicroclusters are associated with weights based on recencyOutliers detected by creating separate microcluster
ClusTreeAllows real-time clustering
Grids
D-StreamAssign the data to gridsGrids weighted by recency of points added to itEach grid associated with a label
DGClustDistributed clustering of sensor data Sensors maintain local copies of the grid and communicate updates to the grid to a central siteStreamKM++ (Coresets)
StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012Kernel-based Clustering
Kernel-based Stream Clustering Use non-linear distance measures to define similarity between data points in the stream
Challenges Quadratic running time complexity Computationally expensive to compute centers using linear sums and squared sums (CF-vector approach will not work)Stream Kernel k-means (sKKM)Kernel k-means WeightedKernel k-means History from only the preceding data chunk retainedApproximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012Statistical Leverage ScoresMeasures the influence of a point in the low-rank approximation
Statistical Leverage ScoresStatistical Leverage Scores
Approximate Stream kernel k-means Uses statistical leverage score to determine which data points in the stream are potentially important
Retain the important points and discard the rest
Use an approximate version of kernel k-means to obtain the clusters Linear time complexity
Bounded amount of memory
Approximate Stream kernel k-means
Importance SamplingClusteringKernel k-means ApproximateKernel k-means ClusteringApproximateKernel k-means Updating eigenvectors Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and clustering
Update the eigenvectors and eigenvalues incrementally
Approximate Stream Kernel k-means
Network Traffic Monitoring Clustering used to detect intrusions in the network Network Intrusion Data set TCP dump data from seven weeks of LAN traffic 10 classes: 9 types of intrusions, 1 class of legitimate traffic.
Running Time in milliseconds (per data point)Cluster Accuracy(NMI)Approximate stream kernel k-means6.614.2StreamKM++0.87.0sKKM42.113.3Around 200 points clustered per secondSummary Efficient kernel-based stream clustering algorithm - linear running time complexity
Memory required is bounded
Real-time clustering is possible
Limitation: does not account for data evolution