Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality...
Transcript of Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality...
![Page 1: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/1.jpg)
Big Data Infrastructure
Week 12: Real-Time Data Analytics (2/2)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States���See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CS 489/698 Big Data Infrastructure (Winter 2016)
Jimmy LinDavid R. Cheriton School of Computer Science
University of Waterloo
March 31, 2016
These slides are available at http://lintool.github.io/bigdata-2016w/
![Page 2: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/2.jpg)
Twitter’s data warehousing architectureWhat’s the issue?
![Page 3: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/3.jpg)
Hashing for Three Common Tasks¢ Cardinality estimation
l What’s the cardinality of set S?l How many unique visitors to this page?
¢ Set membershipl Is x a member of set S?
l Has this user seen this ad before?
¢ Frequency estimationl How many times have we observed x?
l How many queries has this user issued?
HashSet
HashSet
HashMap
HLL counter
Bloom Filter
CMS
![Page 4: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/4.jpg)
HyperLogLog Counter¢ Task: cardinality estimation of set
l size() → number of unique elements in the set
¢ Observation: hash each item and examine the hash codel On expectation, 1/2 of the hash codes will start with 1
l On expectation, 1/4 of the hash codes will start with 01
l On expectation, 1/8 of the hash codes will start with 001
l On expectation, 1/16 of the hash codes will start with 0001
l …
How do we take advantage of this observation?
![Page 5: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/5.jpg)
Bloom Filters¢ Task: keep track of set membership
l put(x) → insert x into the setl contains(x) → yes if x is a member of the set
¢ Componentsl m-bit bit vector
l k hash functions: h1 … hk
0 0 0 0 0 0 0 0 0 0 0 0
![Page 6: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/6.jpg)
Bloom Filters: put
0 0 0 0 0 0 0 0 0 0 0 0
xput h1(x) = 2h2(x) = 5h3(x) = 11
![Page 7: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/7.jpg)
Bloom Filters: put
0 1 0 0 1 0 0 0 0 0 1 0
xput
![Page 8: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/8.jpg)
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
xcontains h1(x) = 2h2(x) = 5h3(x) = 11
![Page 9: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/9.jpg)
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
xcontains h1(x) = 2h2(x) = 5h3(x) = 11
AND = YES A[h1(x)]A[h2(x)]A[h3(x)]
![Page 10: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/10.jpg)
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
ycontains h1(y) = 2h2(y) = 6h3(y) = 9
![Page 11: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/11.jpg)
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
ycontains h1(y) = 2h2(y) = 6h3(y) = 9
What’s going on here?
AND = NO A[h1(y)]A[h2(y)]A[h3(y)]
![Page 12: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/12.jpg)
Bloom Filters¢ Error properties: contains(x)
l False positives possiblel No false negatives
¢ Usage:l Constraints: capacity, error probability
l Tunable parameters: size of bit vector m, number of hash functions k
![Page 13: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/13.jpg)
Count-Min Sketches¢ Task: frequency estimation
l put(x) → increment count of x by onel get(x) → returns the frequency of x
¢ Componentsl k hash functions: h1 … hk
l m by k array of counters
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
m
k
![Page 14: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/14.jpg)
Count-Min Sketches: put
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
xput h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
![Page 15: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/15.jpg)
Count-Min Sketches: put
0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 1 0 0 0 0 0 0 0 0
xput
![Page 16: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/16.jpg)
Count-Min Sketches: put
0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 1 0 0 0 0 0 0 0 0
xput h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
![Page 17: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/17.jpg)
Count-Min Sketches: put
0 2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 0
0 0 0 2 0 0 0 0 0 0 0 0
xput
![Page 18: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/18.jpg)
Count-Min Sketches: put
0 2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 0
0 0 0 2 0 0 0 0 0 0 0 0
yput h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2
![Page 19: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/19.jpg)
Count-Min Sketches: put
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
yput
![Page 20: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/20.jpg)
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
xget h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
![Page 21: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/21.jpg)
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
xget h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
A[h3(x)]MIN = 2
A[h1(x)]A[h2(x)]
A[h4(x)]
![Page 22: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/22.jpg)
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
yget h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2
![Page 23: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/23.jpg)
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
yget h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2
MIN = 1 A[h3(y)]
A[h1(y)]A[h2(y)]
A[h4(y)]
![Page 24: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/24.jpg)
Count-Min Sketches¢ Error properties:
l Reasonable estimation of heavy-hittersl Frequent over-estimation of tail
¢ Usage:l Constraints: number of distinct events, distribution of events, error
boundsl Tunable parameters: number of counters m, number of hash functions k,
size of counters
![Page 25: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/25.jpg)
Three Common Tasks¢ Cardinality estimation
l What’s the cardinality of set S?l How many unique visitors to this page?
¢ Set membershipl Is x a member of set S?
l Has this user seen this ad before?
¢ Frequency estimationl How many times have we observed x?
l How many queries has this user issued?
HashSet
HashSet
HashMap
HLL counter
Bloom Filter
CMS
![Page 26: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/26.jpg)
Source: Wikipedia (River)
Stream Processing Architectures
![Page 27: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/27.jpg)
Producer/Consumers
Producer Consumer
How do consumers get data from producers?
![Page 28: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/28.jpg)
Producer/Consumers
Producer Consumer
Producer pushese.g., callback
![Page 29: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/29.jpg)
Producer/Consumers
Producer Consumer
e.g., poll, tailConsumer pulls
![Page 30: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/30.jpg)
Producer/Consumers
Producer Consumer
Consumer
Consumer
Consumer
Producer
![Page 31: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/31.jpg)
Producer/Consumers
Producer Consumer
Consumer
Consumer
Consumer
ProducerBr
oker
Queue, Pub/Sub
Kafka
![Page 32: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/32.jpg)
Tuple-at-a-Time Processing
![Page 33: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/33.jpg)
Storm¢ Open-source real-time distributed stream processing system
l Started at BackTypel BackType acquired by Twitter in 2011
l Now an Apache project
¢ Storm aspires to be the Hadoop of real-time processing!
![Page 34: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/34.jpg)
Storm Topologies¢ Storm topologies = “job”
l Once started, runs continuously until killed
¢ A Storm topology is a computation graphl Graph contains nodes and edges
l Nodes hold processing logic (i.e., transformation over its input)
l Directed edges indicate communication between nodes
¢ Processing semantics:l At most once: without acknowledgments
l At least once: with acknowledgements
![Page 35: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/35.jpg)
Streams, Spouts, and Bolts
bolt bolt bolt
bolt bolt
bolt bolt
spoutspout
spout
stream
stream stream
• Spouts
– Stream generators
– May propagate a single stream to multiple consumers
• Bolts
– Subscribe to streams
– Streams transformers
– Process incoming streams and produce new ones
• Streams
– The basic collection abstraction: an unbounded sequence of tuples
– Streams are transformed by the processing elements of a topology
![Page 36: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/36.jpg)
Stream Groupings¢ Bolts are executed by multiple workers in parallel
¢ When a bolt emits a tuple, where should it go?
¢ Stream groupings: l Shuffle grouping: round-robin
l Field grouping: based on data value
spout spout
boltbolt
bolt
![Page 37: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/37.jpg)
From Storm to Heron¢ Heron = API compatible re-implementation of Storm
Source: https://blog.twitter.com/2015/flying-faster-with-twitter-heron
![Page 38: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/38.jpg)
Source: https://blog.twitter.com/2015/flying-faster-with-twitter-heron
![Page 39: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/39.jpg)
Mini-Batch Processing
![Page 40: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/40.jpg)
Discretized Stream Processing
Run a streaming computation as a series of very small, deterministic batch jobs
Spark
SparkStreaming
batches of X seconds
live data stream
processed results
§ Chop up the live stream into batches of X seconds
§ Spark treats each batch of data as RDDs and processes them using RDD operations
§ Finally, the processed results of the RDD operations are returned in batches
Source: All following Spark Streaming slides by Tathagata Das
![Page 41: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/41.jpg)
Discretized Stream Processing
Run a streaming computation as a series of very small, deterministic batch jobs
Spark
SparkStreaming
batches of X seconds
live data stream
processed results
§ Batch sizes as low as ½ second, latency ~ 1 second
§ Potential for combining batch processing and streaming processing in the same system
![Page 42: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/42.jpg)
Example: Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of RDD representing a stream of data
batch @ t+1batch @ t batch @ t+2
tweets DStream
Twitter Streaming API
stored in memory as an RDD (immutable, distributed)
![Page 43: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/43.jpg)
Example: Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one ���Dstream to create another DStream
new DStream
new RDDs created for every batch
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags Dstream[#cat, #dog, … ]
![Page 44: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/44.jpg)
Example: Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags DStream
every batch saved to HDFS
![Page 45: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/45.jpg)
Fault-tolerance
§ RDDs are remember the sequence of operations that created it from the original fault-tolerant input data
§ Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant
§ Data lost due to worker failure, can be recomputed from input data
input data replicatedin memory
flatMap
lost partitions recomputed on other workers
tweetsRDD
hashTagsRDD
![Page 46: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/46.jpg)
Key concepts
§ DStream – sequence of RDDs representing a stream of data- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
§ Transformations – modify data from on DStream to another
- Standard RDD operations – map, countByValue, reduce, join, …
- Stateful operations – window, countByValueAndWindow, …
§ Output Operations – send data to external entity
- saveAsHadoopFiles – saves to HDFS
- foreach – do anything with each batch of results
![Page 47: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/47.jpg)
Example: Count the hashtags
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.countByValue()
flatMap
map
reduceByKey
flatMap
map
reduceByKey
…
flatMap
map
reduceByKey
batch @ t+1batch @ t batch @ t+2
hashTags
tweets
tagCounts[(#cat, 10), (#dog, 25), ... ]
![Page 48: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/48.jpg)
Example: Count the hashtags over last 10 mins
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
sliding window operation window length sliding interval
![Page 49: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/49.jpg)
tagCounts
Example: Count the hashtags over last 10 mins
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
hashTags
t-1 t t+1 t+2 t+3
sliding window
countByValue
count over all the data in the
window
![Page 50: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/50.jpg)
?
Smart window-based countByValue
val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1))
hashTags
t-1 t t+1 t+2 t+3
++–
countByValue
add the counts from
the new batch in the window
subtract the counts from batch before the window
tagCounts
![Page 51: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/51.jpg)
Smart window-based reduce
§ Technique to incrementally compute count generalizes to many reduce operations
- Need a function to “inverse reduce” (“subtract” for counting)
§ Could have implemented counting as:
hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …)
![Page 52: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/52.jpg)
Integrating Batch and Online Processing
![Page 53: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/53.jpg)
A domain-specific language (in Scala) designedto integrate batch and online MapReduce computations
Summingbird
Idea #1: Algebraic structures provide the basis for ���seamless integration of batch and online processing
Probabilistic data structures as monoidsIdea #2: For many tasks, close enough is good enough
![Page 54: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/54.jpg)
“map”
flatMap[T, U](fn: T => List[U]): List[U]
map[T, U](fn: T => U): List[U]
filter[T](fn: T => Boolean): List[T]
sumByKey
Batch and Online MapReduce
“reduce”
![Page 55: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/55.jpg)
Semigroup = ( M , ⊕ )⊕ : M × M → M, s.t., ∀m1, m2, m3 ∋ M
Idea #1: Algebraic structures provide the basis for ���seamless integration of batch and online processing
(m1 ⊕ m2) ⊕ m3 = m1 ⊕ (m2 ⊕ m3)
Monoid = Semigroup + identity
Commutative Monoid = Monoid + commutativity
ε s.t., ε ⊕ m = m ⊕ ε = m, ∀m ∋ M
∀m1, m2 ∋ M, m1 ⊕ m2 = m2 ⊕ m1
Simplest example: integers with + (addition)
![Page 56: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/56.jpg)
( a ⊕ b ⊕ c ⊕ d ⊕ e ⊕ f )
You can put the parentheses anywhere!
Batch = Hadoop
Mini-batchesOnline = Storm
Summingbird values must be at least semigroups���(most are commutative monoids in practice)
((((( a ⊕ b ) ⊕ c ) ⊕ d ) ⊕ e ) ⊕ f )(( a ⊕ b ⊕ c ) ⊕ ( d ⊕ e ⊕ f ))
Idea #1: Algebraic structures provide the basis for ���seamless integration of batch and online processing
Power of associativity =
Results are exactly the same!
![Page 57: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/57.jpg)
def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store)
Scalding.run { wordCount[Scalding]( Scalding.source[Tweet]("source_data"), Scalding.store[String, Long]("count_out") ) }
Storm.run { wordCount[Storm]( new TweetSpout(), new MemcacheStore[String, Long] ) }
Summingbird Word Count
Run on Scalding (Cascading/Hadoop)
Run on Storm
where data comes fromwhere data goes
“map”
“reduce”
read from HDFS
write to HDFS
read from message queue
write to KV store
![Page 58: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/58.jpg)
Map Map Map
Input Input Input
Reduce Reduce
Output Output
Spout
Bolt
memcached
Bolt Bolt
Bolt Bolt
![Page 59: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/59.jpg)
“Boring” monoids
addition, multiplication, max, minmoments (mean, variance, etc.)
sets
hashmaps with monoid values
More interesting monoids?
tuples of monoids
![Page 60: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/60.jpg)
Idea #2: For many tasks, close enough is good enough!
“Interesting” monoidsBloom filters (set membership)
HyperLogLog counters (cardinality estimation)Count-min sketches (event counts)
1. Variations on hashing2. Bounded error
Common features
![Page 61: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/61.jpg)
Cheat sheet
Set membership
Set cardinality
Frequency count
set
set
hashmap
Bloom filter
hyperloglog counter
count-min sketches
Exact Approximate
![Page 62: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/62.jpg)
def wordCount[P <: Platform[P]] (source: Producer[P, Query], store: P#Store[Long, Map[String, Long]]) = source.flatMap { query => (query.getHour, Map(query.getQuery -> 1L)) }.sumByKey(store)
def wordCount[P <: Platform[P]] (source: Producer[P, Query], store: P#Store[Long, SketchMap[String, Long]]) (implicit countMonoid: SketchMapMonoid[String, Long]) = source.flatMap { query => (query.getHour, countMonoid.create((query.getQuery, 1L))) }.sumByKey(store)
Exact with hashmaps
Task: count queries by hour
Approximate with CMS
![Page 63: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/63.jpg)
Hybrid Online/Batch Processing
online results key-value store
batch results key-value store
client Summingbird
program
Message Queue
Hadoop job
Storm topology
store1 source2 source3 … store2 store3 … source1
read write
ingest
HDFS
read write
query
query
online
batch
clie
nt li
brar
y
Example: count historical clicks and clicks in real time
![Page 64: Big Data Infrastructure - GitHub Pages · 2016-03-31 · HyperLogLog Counter! Task: cardinality estimation of set " size() → number of unique elements in the set Observation: hash](https://reader033.fdocuments.us/reader033/viewer/2022043001/5f786fe1381f155605522382/html5/thumbnails/64.jpg)
Source: Wikipedia (Japanese rock garden)
Questions?