Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun...
Transcript of Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun...
![Page 1: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/1.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Beyond Scaling: Real-Time Event Processing with
Stream Mining
Mikio L. Braun, @mikiobraunhttp://blog.mikiobraun.de
TWIMPACThttp://twimpact.com
![Page 2: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/2.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Event Data
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie
FinanceFinance GamingGaming MonitoringMonitoring
Social MediaSocial MediaSensor NetworksSensor NetworksAdvertismentAdvertisment
![Page 3: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/3.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Event Data is Huge: Volume
● The problem: You easily get A LOT OF DATA!– 100 events per second
– 360k events per hour
– 8.6M events per day
– 260M events per month
– 3.2B events per year
![Page 4: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/4.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Event Data is Huge: Diversity
● Potentially large spaces:– distinct words: >100k
– IP addresses: >100M
– users in a social network: >10M
http://www.flickr.com/photos/arenamontanus/269158554/http://wordle.net
![Page 5: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/5.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
To Scale Or Not To Scale?
![Page 6: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/6.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Stream Mining to the rescue
● Stream mining algorithms:– answer “stream queries” with finite resources
● Typical examples:– how often does an item appear in a stream?
– how many distinct elements are in the stream?
– what are the top-k most frequent items?
Continuous Stream of Data
Bounded ResourceAnalyzer Stream Queries
![Page 7: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/7.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
The Trade-Off
ExactFast
Big Data
Stream Mining Map Reduce and friends
First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra
![Page 8: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/8.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Heavy Hitters (a.k.a. Top-k)
● Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users)
● Interested in most active elements only.
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005
132
142
432
553
712
023
15
12
8
5
3
2
Fixed tables of counts
Case 1: element already in data base
Case 2: new element
142 142 12 13
713 023 2
713 3
![Page 9: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/9.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Heavy Hitters over Time-Window
Time
DB
● Keep quite a big log (a month?)
● Constant write/erase in database
● Alternative: Exponential decay
![Page 10: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/10.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Exponential Decay
● Instead of a fixed window, use exponential decay
● The beauty: updates are recursivehalftime
score
timestamp
time shift term
![Page 11: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/11.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Exponential Decay
● Collect stats by a table of expdecay counterscounters[item] # counters
ts[item] # last timestamp
● update(C, item, timestamp, count) – update countsC.counters[item] = count + weight(timestamp, C.ts[item]) * C.counters[item]
C.ts[item] = timestampC.lastupdate = timestamp
● score(C, item) – return scorereturn weight(C.lastupdate, C.ts[item]) * C.counters[item]
![Page 12: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/12.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Count-Min Sketches
● Summarize histograms over large feature sets● Like bloom filters, but better
● Query: Take minimum over all hash functions
0 0 3 0
1 1 0 2
0 2 0 0
0 3 5 2
0 5 3 2
2 4 5 0
1 3 7 3
0 2 0 8
m bins
n differenthash functions
Updates for new entryQuery result: 1
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .
![Page 13: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/13.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Wait a minute? Only Counting?
● Well, getting the top most active items is already useful.– Web analytics, Users, Trending Topics
● Counting is statistics!
![Page 14: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/14.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Counting is Statistics
● Empirical mean:
● Correlations:
● Principal Component Analysis:
![Page 15: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/15.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example 1: Least Squares Regression
d
d
with entries
this could be huge!
d x d is probably ok
Idea: Batch method like least squares on recent portion of the data.
But: It's just a sum!
![Page 16: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/16.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Least Squares Regression
● Need to compute● For each do
–
–
● Then, reconstruct –
As a reminder:
![Page 17: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/17.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example 2: Maximum-Likelihood
● Estimate probabilistic models
● But wait, how do I “1/n” with randomly spaced events?
based on
which is slightly biased, but simpler
![Page 18: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/18.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example 3: Outlier detection
● Once you have a model, you can compute p-values (based on recent time frames!)
![Page 19: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/19.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example 4: Clustering● Online clustering
– For each data point:● Map to closest centroid ( compute distances)⇒● Update centroid
– count-min sketches to represent sum over all vectors in a class
Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009
0 0 3 01 1 0 2
0 2 0 00 3 5 2
0 5 3 22 4 5 0
1 3 7 30 2 0 8
![Page 20: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/20.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example 5: TF-IDF
● estimate word – document frequencies
● for each word: update(word, t, 1.0)● for each document: update(“#docs”, t, 1.0)● query: score(word) / score(“#docs”)
![Page 21: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/21.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example 6: Classification with Naïve Bayes● Naive Bayes is also just counting, right?
class priors
Multinomnial naïve Bayes
Priors
Total number of words in class
Number of times word appears in classfrequency of word in document
![Page 22: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/22.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example 6: Classification with Naive Bayes
ICML 2003
![Page 23: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/23.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example 6: Classification with Naive Bayes● 7 Steps to improve NB:
– transform TF to log( . + 1)
– IDF-style normalization
– square length normalization
– use complement probability
– another log
– normalize those weights again
– Predict linearly using those weights
![Page 24: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/24.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
What about non-parametric methods and Kernel Methods?● Problem here, no real accumulation of information
in statistics, e.g. SVMs
● Could still use streamdrill to extract a representative subset.
sum over all elements!
![Page 25: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/25.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Streamdrill
● Heavy Hitters counting + exponential decay● Instant counts & top-k results over time
windows.● Indices!● Snapshots for historical analysis● Beta demo available at http://streamdrill.com,
launch imminent
![Page 26: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/26.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Architecture Overview
![Page 27: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/27.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example: Twitter Stock Analysis
http://play.streamdrill.com/vis/
![Page 28: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/28.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example: Twitter Stock Analysis
![Page 29: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/29.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Example: Twitter Stock Analysis
![Page 30: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords](https://reader033.fdocuments.us/reader033/viewer/2022051322/601d2e979d72b528b21ea662/html5/thumbnails/30.jpg)
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB
Summary
● Doesn't always have to be scaling!● Stream mining: Approximate results with
finite resources.● streamdrill: stream analysis engine