Scalable Distributed Real-Time Clustering for Big Data Streams
European Masters in Distributed Computing (EMDC)
Student Antonio Severien [email protected]
Supervisors Albert Bifet (Yahoo! Research) Gianmarco De Francisci Morales (Yahoo! Research) Marta Arias (Universitat Politecnica de Catalunya)
Contributions
¤ SAMOA (Scalable Advanced Massive Online Analysis) ¤ Stream Processing Engine (SPE) abstraction framework
¤ Machine learning libraries adapter layer
¤ API for implementing data flow topologies
¤ SAMOA Clustering Algorithm ¤ Distributed stream clustering algorithm based on CluStream*
¤ Parallelize clustering task and scale-up on resource usage
27/06/13
2
(*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003
Motivation
¤ How BIG is BIG in BIG Data??? ¤ 2.5 quintillion of bytes generated every day.
¤ 90% of todays data was generated in the last 2 years
¤ Sensors, social networks, e-business, mobile, internet logs, etc.
¤ Problems… 3 Vs ¤ Storage is unviable due to massive Volume
¤ Production rate on increasing in Velocity
¤ Different sources, different data, different types means Variety
27/06/13
3
Where is the Big Data?
¤ Where is the food? ¤ Databases?
¤ Data warehouses?
¤ Distributed databases?
¤ Distributed file systems?
¤ It’s flowing online! It’s Streaming!
27/06/13
4
Crunching Big Data
¤ Map and Reduce ¤ MapReduce/GFS
¤ Hadoop/HDFS
¤ Stream Processing Engines (SPE) ¤ Apache S4
¤ Twitter Storm
27/06/13
5
Distributed Systems
¤ Actors Model ¤ Independent concurrent processes
¤ Communicate asynchronously by message passing
¤ MapReduce Model ¤ Mappers: filter and sorting
¤ Reducers: summary and aggregation
¤ Large volume of data distributed
¤ Iterative: map-reduce-map-reduce…
27/06/13
6
Streaming
¤ Streaming Model ¤ One-pass processing: discard item after use
¤ Low memory usage: store statistics and summaries
¤ Unbounded flow of data
¤ Evolving data sets
¤ Limited processing time
¤ Arrival order is not guaranteed
27/06/13
7
Making sense
¤ Machine Learning & Data Mining ¤ Make sense, extract patterns and react accordingly
¤ Train machines to “think”
¤ Perceive behavior
¤ Relations between similar information
¤ Unsupervised Learning ¤ Clustering algorithms
27/06/13
8
Machine Learning Tools
¤ Mahout ¤ Machine learning framework used on top of Hadoop/HDFS
¤ Batch processing with MapReduce model
¤ Open-source and good community support
¤ Massive Online Analysis (MOA) ¤ Stream machine learning tool
¤ Many algorithms implemented; based on WEKA
¤ Single machine constraint
¤ Jubatus ¤ Distributed streaming machine learning framework
¤ No clustering algorithms yet
¤ No stream platform abstraction
27/06/13
9
Scalable Advanced Massive Online Analysis (SAMOA)
¤ Distributed data streaming machine learning framework ¤ Stream Platform Engine Abstraction
¤ Code once, run everywhere
¤ Focus on distributed algorithm design
¤ Fault-tolerance, communication, consistency and availability are provided by the underlying distributed processing platform
¤ Initial release provides integration with, ¤ Apache S4
¤ Twitter Storm
27/06/13
10
Scalable Advanced Massive Online Analysis (SAMOA)
SAMOA Algorithms &
SAMOA-API
SPE Adapter
S4 Storm Other
SPE
SAMOA
ML
Ad
ap
ter
MOA
Other ML
libraries
27/06/13
11
Scalable Advanced Massive Online Analysis (SAMOA)
SAMOA Algorithms &
SAMOA-API
SPE Adapter
S4 Storm Other
SPE
SAMOA
ML
Ad
ap
ter
MOA
Other ML
libraries
27/06/13
12
Scalable Advanced Massive Online Analysis (SAMOA)
SAMOA Algorithms &
SAMOA-API
SPE Adapter
S4 Storm Other
SPE
SAMOA
ML
Ad
ap
ter
MOA
Other ML
libraries
27/06/13
13
Scalable Advanced Massive Online Analysis (SAMOA)
SAMOA Algorithms &
SAMOA-API
SPE Adapter
S4 Storm Other
SPE
SAMOA
ML
Ad
ap
ter
MOA
Other ML
libraries
27/06/13
14
( Apache S4 )
¤ Distributed, semi fault-tolerant, stream processing platform
¤ Based on the Actors model and inspired by the MapReduce model
¤ Flexibility on data flow; any topology and processor unit can be built, besides the mappers and reducers design
¤ Specialized in processing events from a stream and emitting events into a stream
27/06/13
15
Scalable Advanced Massive Online Analysis (SAMOA)
SAMOA Topology
PI PI
PI PI Task
EPI
STREAM SOURCE
Stream
PE PE
PE
PE
Stream PE
S4 App
STREAM SOURCE
MAP
27/06/13
16
How to use?
¤ Adding SPE using API ¤ S4ProcessingItem: processing element wrapper
¤ S4Stream: wrapper for a S4 stream
¤ S4ComponentFactory: provides components specific from Apache S4, such as processing elements and streams
¤ S4TopologyBuilder: creates the topology instances
¤ Adding algorithm and building topology class SimpleTask { ...
TopologyBuilder topologyBuilder = new TopologyBuilder( ); EntranceProcessinItem entranceProcessingItem = topologyBuilder.createEntrancePI( new SourceProcessor( ) ); Stream stream = topologyBuilder.createStream( entranceProcessingItem ); ProcessingItem processingItem = topologyBuilder.createPI( new Processor( ) ); processingItem.connectInputKey( stream );
...
27/06/13
17
Grouping the Best of All
¤ Flexible programming model
¤ Distributed stream processing engine abstraction
¤ Integrated machine learning and data mining algorithms
¤ Easy API to implement new algorithms and SPE adapters
27/06/13
18
SAMOA Clustering Algorithm
¤ Distributed stream clustering algorithm
¤ Validate SAMOA implementation and
¤ Integration with Apache S4 using the SAMOA-S4 adapter
¤ Deploy on Apache S4
27/06/13
19
Stream Clustering Algorithm
¤ CluStream Framework ¤ Based on k-means
¤ Online phase (micro-clustering)
¤ Offline phase (macro-clustering)
¤ k-means: partition a set of data into k distinct clusters according to a similarity function
¤ Minimization of squared Euclidean distance objective function:
27/06/13
20
K-means Clustering Algorithm
¤ Advantages ¤ Simple, fast and efficient
¤ Known issues with k-means ¤ Sensitive to initial seeding
¤ Minimization problem is NP-hard even for simple configurations
¤ 1-dimensional points
¤ Global optimum not guaranteed
¤ Good for spherical clustering, not good for arbitrary shapes
27/06/13
21
Distributed Stream Clustering
¤ Online micro-clustering ¤ Apply on a local clustering phase
¤ Cluster Feature Vectors with Timestamp (CFT) ¤ N: number of data objects
¤ LS: linear sum of data objects
¤ SS: sum of squares of data objects
¤ LST: sum of timestamps
¤ SST: sum of squares of timestamps
¤ Offline macro-clustering ¤ Use of micro-clusters as weighted pseudo-points
¤ Apply on a global clustering phase with a weighted k-means ¤ Uses probabilistic seeding depending on the weighted
micro-clusters
27/06/13
22
CluStream Snapshot
27/06/13
23
Micro-clusters Macro-clusters
Ground Truth
Scalable Advanced Massive Online Analysis (SAMOA)
SAMOA Clustering Task
Evaluation
Clustering
Sampling PI Evaluator PI
Local Clustering PI
Global Clustering PI
Distribution PI STREAM
SOURCE
OUTPUT
OUTPUT
27/06/13
24
Experiments, Evaluation & Results
¤ Experimental Setup ¤ Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM
¤ Process parallelism level: 1, 8 & 16
¤ Instance dimensions: 3 & 15
¤ Source dataset: random events generator
¤ Noise: 0% & 10%
¤ Cluster movement speed: move 0.1 unit every 500 & 12000 instances
¤ Evaluations ¤ Scalability: measure throughput when adding concurrent
processes
¤ Clustering quality: measure if the clustering algorithm are accurate
27/06/13
25
Scalability
27/06/13
26
Baseline Comparison
Evaluation Step
Thro
ug
hp
ut
(inst
an
ce
s/se
co
nd
)
Scalability
Process Parallelism
27/06/13
27
Average Throughput with Dimensions 3 and 15
Ave
rag
e T
hro
ug
hp
ut
(in
sta
nc
es/
sec
on
d)
Scalability
27/06/13
28
Process Parallelism
Avg
. Cu
mu
lativ
e T
hro
ug
hp
ut
(inst
an
ce
s/se
c)
Parallelism Throughput with Dimension 3
Clustering Quality Metrics
¤ Internal & External evaluations ¤ Internal evaluation uses attributes available from the clustering
structure. ¤ External evaluation uses external validation structures.
¤ ex.: ground truth provided by the source generator.
¤ Metrics ¤ Cohesion coefficient (SSE): measures the intra clusters sum of
squares error
¤ Separation coefficient (BSS): measures the inter cluster between-sum of squares.
27/06/13
29
Clustering Quality 0% Noise
27/06/13
30
Snapshot 25,000 instances
Snapshot 45,000 instances
Clustering Quality 0% Noise
27/06/13
31
Ratio = BSS / GT
Clustering Quality 10% Noise
27/06/13
32
Snapshot 45,000 instances
Snapshot 25,000 instances
Good clustering
Poor clustering
Clustering Quality 10% Noise
27/06/13
33
Conclusion
¤ There is important information on the massive amount of data being produced and discarded
¤ There is a need for tools to deal with this efficiently
¤ Efforts have been done to crunch big data
¤ Interpreting and retrieving relevant information is where machine learning and data mining operate
¤ Using real-time analysis responds faster to evolving data
¤ SAMOA abstracts the platform and maintains the algorithms; good to implement, test and use.
27/06/13
34
Acknowledgements
¤ Thanks the Erasmus Mundus and all three universities (UPC, KTH and IST) for providing this opportunity
¤ Thanks all the EMDC students
¤ Thanks Yahoo! Research for the great project
27/06/13
35
Top Related