Stream Clustering

25
Stream Clustering CSE 902

description

Stream Clustering. CSE 902. Big Data. Stream analysis. Stream : Continuous flow of data Challenges Volume: Not possible to store all the data One-time access: Not possible to process the data using multiple passes - PowerPoint PPT Presentation

Transcript of Stream Clustering

Stream Clustering

Stream ClusteringCSE 902Big Data

Stream analysisStream: Continuous flow of dataChallengesVolume: Not possible to store all the dataOne-time access: Not possible to process the data using multiple passesReal-time analysis: Certain applications need real-time analysis of the dataTemporal Locality: Data evolves over time, so model should be adaptive.Stream Clustering

Topic clusterArticleListingsStream Clustering

Online PhaseSummarize the data into memory-efficient data structures

Offline Phase Use a clustering algorithm to find the data partitionStream Clustering AlgorithmsData StructuresExamplesPrototypesStream, Stream LsearchCF-TreesScalable k-means, single pass k-meansMicrocluster TreesClusTree, DenStream, HP-StreamGridsD-Stream, ODACCoreset Tree

StreamKM++

Prototypes

Stream, LSearchCF-Trees Summarize the data in each CF-vector Linear sum of data points Squared sum of data points Number of points

Scalable k-means, Single pass k-meansMicroclustersCF-Trees with time element

CluStreamLinear sum and square sum of timestampsDelete old microclusters/merging microclusters if their timestamps are close to each other

Sliding Window ClusteringTimestamp of the most recent data point added to the vectorMaintain only the most recent T microclusters

DenStreamMicroclusters are associated with weights based on recencyOutliers detected by creating separate microcluster

MicroclustersCF-Trees with time element

DenStreamMicroclusters are associated with weights based on recencyOutliers detected by creating separate microcluster

ClusTreeAllows real-time clustering

Grids

D-StreamAssign the data to gridsGrids weighted by recency of points added to itEach grid associated with a label

DGClustDistributed clustering of sensor data Sensors maintain local copies of the grid and communicate updates to the grid to a central siteStreamKM++ (Coresets)

StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012Kernel-based Clustering

Kernel-based Stream Clustering Use non-linear distance measures to define similarity between data points in the stream

Challenges Quadratic running time complexity Computationally expensive to compute centers using linear sums and squared sums (CF-vector approach will not work)Stream Kernel k-means (sKKM)Kernel k-means WeightedKernel k-means History from only the preceding data chunk retainedApproximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012Statistical Leverage ScoresMeasures the influence of a point in the low-rank approximation

Statistical Leverage ScoresStatistical Leverage Scores

Approximate Stream kernel k-means Uses statistical leverage score to determine which data points in the stream are potentially important

Retain the important points and discard the rest

Use an approximate version of kernel k-means to obtain the clusters Linear time complexity

Bounded amount of memory

Approximate Stream kernel k-means

Importance SamplingClusteringKernel k-means ApproximateKernel k-means ClusteringApproximateKernel k-means Updating eigenvectors Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and clustering

Update the eigenvectors and eigenvalues incrementally

Approximate Stream Kernel k-means

Network Traffic Monitoring Clustering used to detect intrusions in the network Network Intrusion Data set TCP dump data from seven weeks of LAN traffic 10 classes: 9 types of intrusions, 1 class of legitimate traffic.

Running Time in milliseconds (per data point)Cluster Accuracy(NMI)Approximate stream kernel k-means6.614.2StreamKM++0.87.0sKKM42.113.3Around 200 points clustered per secondSummary Efficient kernel-based stream clustering algorithm - linear running time complexity

Memory required is bounded

Real-time clustering is possible

Limitation: does not account for data evolution