Algorithms for Distributed Functional Monitoring
description
Transcript of Algorithms for Distributed Functional Monitoring
Algorithms for Distributed Functional Monitoring
Ke YiHKUST
Joint work with Graham Cormode (AT&T Labs)S. Muthukrishnan (Google Inc.)
The Story Begins with ...
The Model
1421345
235212
Alice observesA(t) by time t
Bob observesB(t) by time t
A(t), B(t): multisets
Carole tries to computef (A(t)UB(t)) for all t
All parties have infinite computing powerGoal is to minimize communication
t
The Model
1421345
235212
2 31313
253322
k sites
Continuous Communication Model / Distributed Streaming Model
Combination of Two Models
3
11
2 4
2 3
11
2 4
2
Communication model
14213
Streaming model
Continuous Communication Model Distributed Streaming Model
One-shot Model
“ ”
Other Models [Gibbons and Tirthapura, 2001]
1421345
235212
Carole tries to computef (AUB) in the end
All parties make one pass using small memory small communication
t
Applied Motivation: Distributed Monitoring
Large-scale querying/monitoring: Inherently distributed! Streams physically distributed across remote sites
E.g., stream of UDP packets through routers Challenge is “holistic” querying/monitoring
Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …) Streaming data is spread throughout the network
Network Operations
Center (NOC)
Query site Query
0 11
1 1
00
1
1 0
0
11
0
11
0
11
0
11
Q(S1 ∪ S2 ∪…)
S6
S5S4
S3S1
S2
Slide from the tutorial “Streaming in a connected world: Querying and trackingdistributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis]
Applied Motivation: Distributed Monitoring
Traditional approach: “pull” based Query all nodes once for a while Expensive communication, most is wasted Inaccurate
Current trend: moving towards a “push” based approach The remote sites alert the coordinator when something interesting
happens
Network Operations
Center (NOC)
Query site Query
0 11
1 1
00
1
1 0
0
11
0
11
0
11
0
11
Q(S1 ∪ S2 ∪…)
S6
S5S4
S3S1
S2
Theoretical Questions
Upper bounds: Worst-case communication bounds for a given f ?
Lower bounds: Is there a gap in the communication complexity between the one-shot model and the continuous model?
The Frequency Moments
Assume integer domain [n] = {1, …, n}i appears mi timesThe p-th frequency moment:F1 is the cardinality of AF0 is # unique items in A (define 00=0)F2 is
Gini’s index of homogeneity in statisticsself-join size in db
Extensively studied since [Alon, Matias, and Szegedy, 1999]
Approximate Monitoring
Must trigger alarm when Fp > τCannot trigger alarm when Fp < (1 − ε) τ
Why approximate: Exact monitoring is expensive and unnecessary
Why monitoring Most applications only need monitoring Tracking can be simulated by monitoring with τ = 1+ε, (1+ε)2,
(1+ε)3, …, so at most an O(1/ε) factor away.
time
Fp
τ
(1 − ε) τ
alarm
Prior Work
Several papers in the database literatureMostly heuristic basedBad worst-case bounds, no lower bounds
F1: O(k/ε log(τ/k)) [SIGMOD’06]
F0: Õ(k2/ε3) [ICDE’06]
F2: Õ(k2/ε4) [VLDB’05]Õ() suppresses polylog factors
O(k log(1/ε))Õ(k/ε2)Õ(k2/ε+k3/2/ε3)
Continuous vs One-Shot
If there is a continuous monitoring algorithm that communicates X bits, then there is a one-shot algorithms that communicates O(X+k) bits
Our Results
Good news: all continuous bounds (except F2) are close to their one-shot counterparts
Bad news: all continuous bounds (except F2) are close to their one-shot counterparts
Talk Outline
IntroductionDeterministic F1 algorithm: O(k log(1/ε))Randomized F1 algorithm: O(1/ε2∙log(1/δ))Randomized F0 algorithm: Õ(k/ε2)Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)Conclusions
Deterministic F1 Algorithm
The first round:
τ/2k
coordinator
Terminates round after receiving k signalsτ/2k · k = τ/2 < F1 < τ
Deterministic F1 Algorithm
The second round:
τ/4k
coordinator
Deterministic F1 Algorithm
The second round:
τ/4k
coordinator
Terminates round after receiving k signals3τ/4 < F1 < τ
Deterministic F1 Algorithm
Each round communicates O(k) bitsContinue until Δ=ετ O(log(1/ε)) rounds
Δ=ετ
coordinator
After the last round, we have (1-ε)τ < F1 < τ
Total communication: O(k log(1/ε))Lower bound: Ω(k log(1/(εk)))
One-Shot: O(k log(1/ε))Lower bound: Ω(k log(1/(εk)))
Talk Outline
IntroductionDeterministic F1 algorithm: O(k log(1/ε))Randomized F1 algorithm: O(1/ε2∙log(1/δ))Randomized F0 algorithm: Õ(k/ε2)Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)Conclusions
F0: # Distinct Items
Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits
Consider the one-shot case firstUse “sketches”: small-space streaming
algorithms “Combine” the sketches from the k sitesFM sketch [Flajolet and Martin 1985; Alon, Matias,
and Szegedy, 1999]
FM Sketch
Take a pair-wise independent random hash function h : {1,…,n} {1,…,2d}, where 2d > n
For each incoming element x, compute h(x)e.g., h(5) = 10101100010000Count how many trailing zerosRemember the maximum number of trailing zeroes in
any h(x)Let Y be the maximum number of trailing zeroes
Can show E[2Y] = # distinct elements
FM Sketch
So 2Y is an unbiased estimator for # distinct elementsHowever, has a large variance
Some recent techniques [Gibbons and Tirthapura, 2001; Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce a good estimator that has probability 1–δ to be within relative error ε
Space increased to Õ(1/ε2)FM sketch has linearity
Y1 from A, Y2 from B, then 2max{Y1, Y2} estimates # distinct items in AUB
A one-shot algorithm with communication Õ(k/ε2)
Continuously Monitoring F0
FM sketch is monotoneYi is non-decreasing, and Yi < log nWhenever Yi increases, notify the coordinatorThe coordinator can always have the up-to-
date combined FM sketch Total communication: Õ(k/ε2)
Lower bound: Ω(k)
Talk Outline
IntroductionDeterministic F1 algorithm: O(k log(1/ε))Randomized F1 algorithm: O(1/ε2∙log(1/δ))Randomized F0 algorithm: Õ(k/ε2)Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)Conclusions
F2: The One-Shot Case
Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits
Consider the one-shot case firstUse “sketches”: small-space streaming
algorithms “Combine” the sketches from the k sitesAMS sketch [Alon, Matias, and Szegedy, 1999]
AMS Sketch: “Tug-of-War”
Take a 4-wise independent random hash functionh : {1,…,n} {−1,+1}
Compute Y = ∑ h(x)
over all xY2 is an unbiased estimator for F2
Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε
Linearity still holds!o One-shot case can be solved with communication Õ(k/ε2)
However…
Y is not monotone!
Can’t afford to send all changes of the local sketch to the coordinator
F2 Monitoring: Multi-Round Algorithm
Beginning of a round
sketch Õ(1/ε2)sketch Õ(1/ε2)
estimate for F2
coordinator
F2 Monitoring: Multi-Round Algorithm
During a round
estimate for F2
coordinator
sends a signal wheneverthe F2 of the updates increasesby t = (τ − F2)2/(64k2τ)
F2 Monitoring: Multi-Round Algorithm
End of a round: when k signals are received
estimate for F2
coordinator
old F2 + (τ − old F2) ∙ ε/k < new F2 < τ
# rounds: O(k/ε)Total cost: Õ(k2/ε3)
F2: Round / Sub-Round Algorithm
End of a sub-round: when k signals are received
estimate for F2
coordinator
old F2 + (τ − old F2) ∙ ε/k < new F2 < τ
“rough” sketchof size Õ(1)
“rough” sketchof size Õ(1)
combine sketchesmaintain an upper bound of F2
k
Total cost: Õ(k2/ε+k3/2/ε3)
One-shot: Õ(k/ε2)Lower bound: Ω(k)
Open Problems
Still no clear separation between the one-shot model and the continuous model F2 is an interesting case
Many other functions f Statistics: entropy, heavy hitters Geometric measures: diameter, width, …
Variations of the model One-way vs two-way communication Does having a broadcast channel help? Sliding windows?
“Continuous Communication Complexity”?
Thank you!