Leveraging Big Data - Cohen-Wang...The Big Data Challenge To be able to handle and leverage this...
Transcript of Leveraging Big Data - Cohen-Wang...The Big Data Challenge To be able to handle and leverage this...
Leveraging Big Data
Instructors:
http://www.cohenwang.com/edith/bigdataclass2013
Disclaimer: This is the first time we are offering this class (new material also to the instructors!) β’ EXPECT many glitches β’ Ask questions
Edith Cohen
Amos Fiat
Haim Kaplan
Tova Milo
What is Big Data ? Huge amount of information, collected continuously: network activity, search requests, logs, location data, tweets, commerce, data footprint for each person β¦.
Whatβs new ?
Scale: terabytes -> petabytes -> exabytes -> β¦
Diversity: relational, logs, text, media, measurements
Movement: streaming data, volumes moved around
Eric Schmidt (Google) 2010: βEvery 2 Days We Create As Much Information As We Did Up to 2003β
The Big Data Challenge
To be able to handle and leverage this information, to offer better services, we need
Architectures and tools for data storage, movement, processing, mining, β¦.
Good models
Big Data Implications
β’ Many classic tools are not all that relevant β Canβt just throw everything into a DBMS
β’ Computational models: β map-reduce (distributing/parallelizing computation) β data streams (one or few sequential passes)
β’ Algorithms: β Canβt go much beyond βlinearβ processing β Often need to trade-off accuracy and computation cost
β’ More issues: β Understand the data: Behavior models with links to
Sociology, Economics, Game Theory, β¦ β Privacy, Ethics
This Course
Selected topics that
β’ We feel are important
β’ We think we can teach
β’ Aiming for breadth
β but also for depth and developing good working understanding of concepts
http://www.cohenwang.com/edith/bigdataclass2013
Today
β’ Short intro to synopsis structures β’ The data streams model β’ The Misra Gries frequent elements summary
Stream algorithm (adding an element) Merging Misra Gries summaries
β’ Quick review of randomization β’ Morris counting algorithm
Stream counting Merging Morris counters
β’ Approximate distinct counting
Synopsis (Summary) Structures
Examples: random samples, sketches/projections, histograms, β¦
A small summary of a large data set that (approximately) captures some statistics/properties we are interested in.
Data π Synopsis πΊ
Query a synopsis: Estimators
A function π we apply to a synopsis πΊ in order to
obtain an estimate π (πΊ) of a property/statistics/function π(π) of the data π
Data π Synopsis πΊ
? π(π) π (πΊ)
Synopsis Structures
Useful features:
Easy to add an element
Mergeable : can create summary of union from summaries of data sets
Deletions/βundoβ support
Flexible: supports multiple types of queries
A small summary of a large data set that (approximately) captures some statistics/properties we are interested in.
Mergeability
Data 1 Synopsis 1
Data 2 Synopsis 2
Data 1 + 2 Synopsis 12
Enough to consider merging two sketches
Why megeability is useful
Synopsis 1
Synopsis 5 S. 1 βͺ 2
Synopsis 4 Synopsis 3
Synopsis 2
S. 1 βͺ 2 βͺ 5
S. 3 βͺ 4
1 βͺ 2 βͺ 3 βͺ 4 βͺ 5
Synopsis Structures: Why?
Data can be too large to:
Keep for long or even short term
Transmit across the network
Process queries over in reasonable time/computation
Data, data, everywhere. Economist 2010
The Data Stream Model
Data is read sequentially in one (or few) passes
We are limited in the size of working memory.
We want to create and maintain a synopsis which allows us to obtain good estimates of properties
Streaming Applications
Network management: traffic going through high speed routers (data can not be revisited)
I/O efficiency (sequential access is cheaper than random access)
Scientific data, satellite feeds
Streaming model
Sequence of elements from some domain
<x1, x2, x3, x4, ..... >
Bounded storage:
working memory << stream size
usually O(logππ) or O(ππΌ) for πΌ < 1
Fast processing time per stream element
What can we compute over a stream ?
Some functions are easy: min, max, sum, β¦
We use a single register π, simple update:
β’ Maximum: Initialize π β 0
For element π , π β max π, π
β’ Sum: Initialize π β 0
For element π , π β π + π
32, 112, 14, 9, 37, 83, 115, 2,
The βsynopsisβ here is a single value. It is also mergeable.
Frequent Elements
Elements occur multiple times, we want to find the elements that occur very often.
Number of distinct element is π
Stream size is π
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Frequent Elements
Applications:
Networking: Find βelephantβ flows
Search: Find the most frequent queries
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Zipf law: Typical frequency distributions are highly skewed: with few very frequent elements. Say top 10% of elements have 90% of total occurrences. We are interested in finding the heaviest elements
Frequent Elements: Exact Solution
Exact solution:
Create a counter for each distinct element on its first occurrence
When processing an element, increment the counter
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
32 12 14 7 6 4
Problem: Need to maintain π counters. But can only maintain π βͺ π counters
Frequent Elements: Misra Gries 1982
Processing an element π
If we already have a counter for π, increment it
Else, If there is no counter, but there are fewer than π counters, create a counter for π initialized to π.
Else, decrease all counters by π. Remove π counters.
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
32 12 14 12 7 12 4 π = 6 π = 3
π = 11
Frequent Elements: Misra Gries 1982
Processing an element π
If we already have a counter for π, increment it
Else, If there is no counter, but there are fewer than π counters, create a counter for π initialized to π.
Else, decrease all counters by π. Remove π counters.
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Query: How many times π occurred ? If we have a counter for π, return its value Else, return π.
This is clearly an under-estimate. What can we say precisely?
Misra Gries 1982 : Analysis How many decrements to a particular π can we have ?
βΊ How many decrement steps can we have ?
Suppose total weight of structure (sum of counters) is πβ²
Total weight of stream (number of occurrences) is π
Each decrement step results in removing π counts from structure, and not counting current occurrence of the input element. That is π + 1 βuncountedβ occurrences.
β There can be at most πβπβ²
π+1 decrement steps
β Estimate is smaller than true count by at most πβπβ²
π+π
Misra Gries 1982 : Analysis
Estimate is smaller than true count by at most πβπβ²
π+π
β We get good estimates for π₯ when the number
of occurrences β«πβπβ²
π+1
Error bound is inversely proportional to π
The error bound can be computed with summary: We can track π (simple count), know πβ (can be computed from structure) and π.
MG works because typical frequency distributions have few very popular elements βZipf lawβ
Merging two Misra Gries Summaries [ACHPWY 2012]
Basic merge:
If an element π₯ is in both structures, keep one counter with sum of the two counts
If an element π₯ is in one structure, keep the counter
Reduce: If there are more than π counters
Take the π + 1 th largest counter
Subtract its value from all other counters
Delete non-positive counters
Merging two Misra Gries Summaries
7 6 14 32 12 14
32 12 14 7 6
Basic Merge:
Merging two Misra Gries Summaries
32 12 14 7 6
4th largest
Reduce since there are more than π = π counters :
Take the π + 1 th = 4th largest counter
Subtract its value (2) from all other counters
Delete non-positive counters
Merging MG Summaries: Correctness
Claim: Final summary has at most π counters
Proof: We subtract the (π + 1)th largest from everything, so at most the π largest can remain positive.
Claim: For each element, final summary count is
smaller than true count by at most πβπβ²
π+1
Merging MG Summaries: Correctness
Claim: For each element, final summary count is
smaller than true count by at most πβπβ²
π+1
Part 1: Total occurrences: π1 In structure: π1β²
Count loss: β€ππβππβ²
π+π
Part 2: Total occurrences: π2 In structure: π2β²
Count loss: β€ππβππβ²
π+π
Proof: βCountsβ for element π₯ can be lost in part1, part2, or in the reduce component of the merge We add up the bounds on the losses
Reduce loss is at most πΏ = (π + π)th largest counter
Merging MG Summaries: Correctness
Part 1: Total occurrences: π1 In structure: π1β²
Count loss: β€ππβππβ²
π+π
Part 2: Total occurrences: π2 In structure: π2β²
Count loss: β€ππβππβ²
π+π
β βCount lossβ of one element is at most
ππβππβ²
π+π+
ππβππβ²
π+π+ πΏ
Reduce loss is at most πΏ = (π + π)th largest counter
Merging MG Summaries: Correctness
β at most π β πβ²
π + 1 uncounted occurrences
Counted occurrences in structure: After basic merge and before reduce: π1
β² + π2β² After reduce: πβ²
Claim: m1β² + m2
β² β πβ² β₯ π π + 1
Proof: π are erased in the reduce step in each of the π + 1 largest counters. Maybe more in smaller counters.
βCount lossβ of one element is at most
ππβππβ²
π+π+
ππβππβ²
π+π+ πΏ β€
π
π+πππ + ππ β πβ²
Using Randomization
β’ Misra Gries is a deterministic structure
β’ The outcome is determined uniquely by the input
β’ Usually we can do much better with randomization
Randomization in Data Analysis
Often a critical tool in getting good results
Random sampling / random projections as a means to reduce size/dimension
Sometimes data is treated as samples from some distribution, and we want to use the data to approximate that distribution (for prediction)
Sometimes introduced into the data to mask insignificant points (for robustness)
Randomization: Quick review
Random variable (discrete or continuous) πΏ
Probability Density Function (PDF)
ππΏ(π) : Probability/density of πΏ = π
Properties: ππΏ π β₯ π ππΏ πβ
ββπ π = π
Cumulative Distribution Function (CDF)
ππΏ π = ππΏ ππ
ββπ π : probability that πΏ β€ π
Properties: ππΏ π monotone non-decreasing from 0 to 1
Quick review: Expectation
Expectation: βaverageβ value of πΏ :
π = π¬ πΏ = π ππΏ πβ
ββπ π
Linearity of Expectation: π¬[ππΏ + π] = ππ¬[πΏ] + π
For random variables πΏπ, πΏπ, πΏπ, . . . , πΏπ
π¬ πΏπ
π
π=π
= π¬[πΏπ]
π
π=π
Quick review: Variance
Variance π πΏ = ππ = π¬[ πΏ β π)π
= (π β π)πππΏ π π π
β
ββ
Useful relations: ππ = π¬ ππ β ππ
π[ππΏ + π] = πππ½[πΏ]
The standard deviation is π = π½[πΏ]
Coefficient of Variation π
π
Quick review: CoVariance CoVariance (measure of dependence between two
random variables) πΏ, π
Cov πΏ, π = π πΏ, π = π¬ πΏ β ππΏ π β ππ = π ππ β ππΏππ
πΏ, π are independent βΉ π πΏ, π = π
Variance of the sum of πΏπ, πΏπ,β¦, πΏπ
π πΏπ
π
π=π
= ππ¨π―[πΏπ, πΏπ]
π
π,π=π
= π½[πΏπ]
π
π=π
+ ππ¨π―[πΏπ, πΏπ]
π
πβ π
When (pairwise) independent
Back to Estimators
A function π we apply to βobserved dataβ (or to a
βsynopsisβ) πΊ in order to obtain an estimate π (πΊ) of a property/statistics/function π(π) of the data π
Data π Synopsis πΊ
? π(π) π (πΊ)
Quick Review: Estimators
Error err π = π πΊ β π(π)
Bias Bias π π] = E[err π ] = πΈ[π ] β π(π) When Bias = 0 estimator is unbiased
Mean Square Error (MSE):
E err π 2
= π π + Bias π 2
Root Mean Square Error (RMSE): βπππΈ
A function π we apply to βobserved dataβ (or to a
βsynopsisβ) πΊ in order to obtain an estimate π (πΊ) of a property/statistics/function π(π) of the data π
Back to stream counting
β’ Count: Initialize π β 0
For each element, π β π + π
What if we are happy with an approximate count ?
1, 1, 1, 1, 1, 1, 1, 1,
Register (our synopsis) size (bits) is βlog2 πβ where π is the current count
Can we use fewer bits ? Important when we have many streams to count, and fast memory is scarce (say, inside a backbone router)
Morris Algorithm 1978
Idea: track π₯π¨π π instead of π
Use π₯π¨π π₯π¨π π bits instead of π₯π¨π π bits
The first streaming algorithm
Stream counting:
Stream of +1 increments
Maintain an approximate count
1, 1, 1, 1, 1, 1, 1, 1,
Morris Algorithm
Maintain a βlogβ counter π
Increment: Increment with probability πβπ
Query: Output ππ β π
1, 1, 1, 1, 1, 1, 1, 1, Stream:
Counter π : 0
π = 2βπ₯: 1
1 1 2 2 2 2 3 3
Estimate π : 0 1 1 3 3 3 3 7 7
1, 2, 3, 4, 5, 6, 7, 8, Count π: 1
2
1
2
1
4
1
4
1
4
1
4
1
8
1
8
Morris Algorithm: Unbiasedness
When π = π, π = π,
estimate is π = ππ β π = π
When π = π,
with π =1
2 , π = π , π = π
with π =1
2 , π = π , π = ππ β π = π
Expectation: E π =π
πβ π +
π
πβ π = π
π = π, π, π β¦ by inductionβ¦.
Morris Algorithm: β¦Unbiasedness
πΏπ is the random variable corresponding to the counter π₯ when the count is π
We need to show that
E π = π ππΏπ β π = π
That is, to show that π ππΏπ = π + π
π ππΏπ = ππ«π¨π πΏπβπ = π π ππΏπ πΏπβπ = π]
πβ₯π
β’ We next compute: π ππΏπ πΏπβπ = π]
Morris Algorithm: β¦Unbiasedness
Computing π ππΏπ πΏπβπ = π]:
β’ with probability π = π β πβπ: π = π, 2π₯ = 2π
β’ with probability π = πβπ: π = π + π, 2π₯ = 2π+1
π ππΏπ πΏπβπ = π] = π β πβπ ππ + πβπππ+π
= ππ β π + π = ππ + π
Morris Algorithm: β¦Unbiasedness
= ππ«π¨π πΏπβπ = π (ππ£+π)
πβ₯π
= ππ«π¨π πΏπβπ = π (ππ£βπ) +
πβ₯π
ππ«π¨π πΏπβπ = π π
πβ₯π
π ππΏπ πΏπβπ = π] = ππ + π
π ππΏπ = ππ«π¨π πΏπβπ = π π ππΏπ πΏπβπ = π]
πβ₯π
= π ππΏπβπ β π = π β πby induction hyp. π
= π + π
Morris Algorithm: Variance
How good is the estimate?
β’ The r.v.βs π = 2ππ β 1and π + 1 = π = 2ππ have the same variance V π = π[π + 1]
β’ π π + 1 = πΈ 22ππ β (π + 1)2
β’ We can show πΈ 22ππ =3
2π2 +
3
2π + 1
β’ This means π π β1
2π2 and CV =
Ο
π β
1
2
How to reduce the error ?
Morris Algorithm: Reducing variance 1
π π = Ο2 β1
2π2 and CV =
Ο
πβ
1
2
Dedicated Method: Base change β
IDEA: Instead of counting π₯π¨π π π, count π₯π¨π π π
Increment counter with probability πβπ
When π is closer to 1, we increase accuracy but also increase counter size.
Morris Algorithm: Reducing variance 2
π π = Ο2 β1
2π2 and CV =
Ο
πβ
1
2
Generic Method:
Use π independent counters ππ, ππ, β¦ , ππ
Compute estimates ππ = πππ β π
Average the estimates
πβ² = ππ
ππ=π
π
Reducing variance by averaging
π (pairwise) independent estimates ππ with expectation π and variance ππ.
The average estimator πβ² = ππ
ππ=π
π
Expectation: π¬ πβ² =π
π π¬ ππ =
π
πππ = ππ
π=π
Variance: π
π
π π½ ππ =
π
π
ππππ =
ππ
πππ=π
CV : π
π decreases by a factor of π
Merging Morris Counters
We have two Morris counters π, π for streams π, π of sizes ππ₯ , ππ¦
Would like to merge them: obtain a single counter π which has the same distribution (is a Morris counter) for a stream of size ππ₯ + ππ¦
Merging Morris Counters
Merge the Morris counts π, π (into π): For π = 1 β¦ π Increment π with probability πβπ₯+πβπ
Morris-count stream π to get π
Morris-count stream π to get π
Correctness for π = 0: at all steps we have we π = π β 1 and probability= 1 . In the end we have π = π
Correctness (Idea): We will show that the final value of π βcorrespondsβ to counting π after X
Merging Morris Counters: Correctness
We want to achieve the same effect as if the Morris counting was applied to a concatenation of the streams π π
We consider two scenarios :
1. Morris counting applied to π
2. Morris counting applied to π after π
We want to simulate the result of (2) given π (result of (1)) and π
Merging Morris Counters: Correctness
Associate an (independent) random u(π§) βΌ π[0,1] with each element π§ of the stream
Process element π§ : Increment π if u(π§) < πβπ
Restated Morris (for sake of analysis only)
We βmapβ executions of (1) and (2) by looking at the same randomization u.
We will see that each execution of (1), in terms of the set of elements that increment the counter, maps to many executions of (2)
Merging algorithm: Correctness Plan
We fix the whole run (and randomization) on π. We fix the set of elements that result in counter
increments on π in (1) We work with the distribution of u: π
conditioned on the above. We show that the corresponding distribution
over executions of (2) (set of elements that increment the counter) emulates our merging algorithm.
What is the conditional distribution?
β’ Elements that did not increment counter when counter value was π₯ have π’ π§ β₯ 2βπ₯
β’ Elements that did increment counter have π’ π§ β€ 2βπ₯
1, 1, 1, 1, 1, 1, 1, 1, Stream:
π = 2βπ₯: 1 1
2 1
2
1
4
1
4
1
4
1
4
1
8
1
8
π : [0,1] [1
2,1] [0,
1
2] [
1
4,1] [
1
4,1] [
1
4,1] [0,
1
4] [
1
8, 1]
To show correctness of merge, suffices to show:
Elements of π that did not increment in (1) do not increment in (any corresponding run of) (2)
Element π§ that had the ππ‘β increment in (1), conditioned on π₯ in the simulation so far, increments in (2) with probability πβπ₯+πβπ
We show this inductively.
Also show that at any point π₯ β₯ π¦β², where π¦β² is the count in (1).
Merge the Morris counts π, π (into π): For π = 1 β¦ π Increment π with probability πβπ₯+πβπ
The first element of π incremented the counter in (1). It has π’ π§ β [0,1]. The probability that it gets counted in (2) is
Pr u z β€ 2βπ₯ π’ π§ β 0,1 ] = πβπ₯ Initially, π β₯ yβ² = 0. After processing, πβ² = π. If
π was initially 0, it is incremented with probability 1, so we maintain π β₯ yβ².
Merge the Morris counts π, π (into π): For π = 1 β¦ π Increment π with probability πβπ₯+πβπ
Proof: An element π§ of π that did not increment the counter when its value in (1) was π¦β², has π’ π§ β[2βπ¦β², 1].
Since we have π₯ β₯ π¦β², this element will also not
increment in (2), since u π§ β₯ 2βπ¦β²β₯ 2βπ₯.
The counter in neither (1) nor (2) changes after processing π§, so we maintain the relation π₯ β₯ π¦β².
Merge the Morris counts π, π (into π): For π = 1 β¦ π Increment π with probability πβπ₯+πβπ
Elements of π that did not increment in (1) do not increment in (any corresponding run of) (2)
Element π§ that had the ππ‘β increment in (1), conditioned on π₯ in the simulation so far, increments in (2) with probability πβπ₯+πβπ
Merge the Morris counts π, π (into π): For π = 1 β¦ π Increment π with probability πβπ₯+πβπ
Proof: Element π§ has u π§ β 0, 2β(πβ1) (we had yβ² = π β 1 before the increment). Element π§ increments in (2) βΊ u π§ β 0, 2βπ₯ .
Pr u π§ β 0, 2βπ₯ |u π§ β 0, 2β πβ1 = πβπ₯+πβπ
β’ If we had equality π₯ = π¦β² = π β 1, π₯ is incremented with probability 1, so we maintain the relation π₯ β₯ π¦β²
Random Hash Functions
For a domain π« and a probability distribution π over πΉ
A distribution over a family π― of hash functions β: π« β πΉ with the following properties:
Each function β β π» has a concise representation and it is easy to choose β βΌ π»
For each π₯ β π«, when choosing β βΌ π»
β π₯ βΌ π (β π₯ is a random variable with distribution π)
The random variables β π₯ are independent for different π₯ β π·.
We use random hash functions as a way to attach a βpermanentβ random value to each identifier in an execution
Simplified and Idealized
Counting Distinct Elements
Elements occur multiple times, we want to count the number of distinct elements.
Number of distinct element is π (= 6 in example)
Number of elements in this example is 11
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Counting Distinct Elements: Example Applications
Networking:
Packet or request streams: Count the number of distinct source IP addresses
Packet streams: Count the number of distinct IP flows (source+destination IP, port, protocol)
Search: Find how many distinct search queries were issued to a search engine each day
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Distinct Elements: Exact Solution
Exact solution:
Maintain an array/associative array/ hash table
Hash/place each element to the table
Query: count number of entries in the table
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Problem: For π distinct elements, size of table is Ξ©(π)
But this is the best we can do (Information theoretically) if we want an exact distinct count.
Distinct Elements: Approximate Counting
IDEA: Size-estimation/Min-Hash technique : [Flajolet-Martin 85, C 94]
Use a random hash function β π₯ βΌ π[0,1]mapping element IDs to uniform random numbers in [0,1]
Track the minimum β π₯ Intuition: The minimum and π are very related : With π distinct elements, expectation of the minimum
E min β x =1
π+1
Can use the average estimator with π repetitions
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Bibliography
Misra Gries Summaries J. Misra and David Gries, Finding Repeated Elements. Science of Computer Programming 2, 1982
http://www.cs.utexas.edu/users/misra/scannedPdf.dir/FindRepeatedElements.pdf
Merging: Agarwal, Cormode, Huang, Phillips, Wei, and Yi, Mergeable Summaries, PODS 2012
Approximate counting (Morris Algorithm) Robert Morris. Counting Large Numbers of Events in Small Registers. Commun. ACM, 21(10): 840-
842, 1978 http://www.inf.ed.ac.uk/teaching/courses/exc/reading/morris.pdf
Philippe Flajolet Approximate counting: A detailed analysis. BIT 25 1985 http://algo.inria.fr/flajolet/Publications/Flajolet85c.pdf
Merging Morris counters: these slides
Approximate distinct counting P. Flajolet and G. N. Martin. Probabilistic counting. In Proceedings of Annual IEEE Symposium
on Foundations of Computer Science (FOCS), pages 76β82, 1983 E. Cohen Size-estimation framework with applications to transitive closure and reachability,
JCSS 1997