Algorithms for Distributed Functional Monitoring

Algorithms for Distributed Functional Monitoring

Ke YiHKUST

Joint work with Graham Cormode (AT&T Labs)S. Muthukrishnan (Google Inc.)

The Story Begins with ...

The Model

1421345

235212

Alice observesA(t) by time t

Bob observesB(t) by time t

A(t), B(t): multisets

Carole tries to computef (A(t)UB(t)) for all t

All parties have infinite computing powerGoal is to minimize communication

t

The Model

1421345

235212

2 31313

253322

k sites

Continuous Communication Model / Distributed Streaming Model

Combination of Two Models

3

11

2 4

2 3

11

2 4

2

Communication model

14213

Streaming model

Continuous Communication Model Distributed Streaming Model

One-shot Model

“ ”

Other Models [Gibbons and Tirthapura, 2001]

1421345

235212

Carole tries to computef (AUB) in the end

All parties make one pass using small memory small communication

t

Applied Motivation: Distributed Monitoring

Large-scale querying/monitoring: Inherently distributed! Streams physically distributed across remote sites

E.g., stream of UDP packets through routers Challenge is “holistic” querying/monitoring

Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …) Streaming data is spread throughout the network

Network Operations

Center (NOC)

Query site Query

0 11

1 1

00

1

1 0

0

11

0

11

0

11

0

11

Q(S1 ∪ S2 ∪…)

S6

S5S4

S3S1

S2

Slide from the tutorial “Streaming in a connected world: Querying and trackingdistributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis]

Applied Motivation: Distributed Monitoring

Traditional approach: “pull” based Query all nodes once for a while Expensive communication, most is wasted Inaccurate

Current trend: moving towards a “push” based approach The remote sites alert the coordinator when something interesting

happens

Network Operations

Center (NOC)

Query site Query

0 11

1 1

00

1

1 0

0

11

0

11

0

11

0

11

Q(S1 ∪ S2 ∪…)

S6

S5S4

S3S1

S2

Theoretical Questions

Upper bounds: Worst-case communication bounds for a given f ?

Lower bounds: Is there a gap in the communication complexity between the one-shot model and the continuous model?

The Frequency Moments

Assume integer domain [n] = {1, …, n}i appears mi timesThe p-th frequency moment:F1 is the cardinality of AF0 is # unique items in A (define 00=0)F2 is

Gini’s index of homogeneity in statisticsself-join size in db

Extensively studied since [Alon, Matias, and Szegedy, 1999]

Approximate Monitoring

Must trigger alarm when Fp > τCannot trigger alarm when Fp < (1 − ε) τ

Why approximate: Exact monitoring is expensive and unnecessary

Why monitoring Most applications only need monitoring Tracking can be simulated by monitoring with τ = 1+ε, (1+ε)2,

(1+ε)3, …, so at most an O(1/ε) factor away.

time

Fp

τ

(1 − ε) τ

alarm

Prior Work

Several papers in the database literatureMostly heuristic basedBad worst-case bounds, no lower bounds

F1: O(k/ε log(τ/k)) [SIGMOD’06]

F0: Õ(k2/ε3) [ICDE’06]

F2: Õ(k2/ε4) [VLDB’05]Õ() suppresses polylog factors

O(k log(1/ε))Õ(k/ε2)Õ(k2/ε+k3/2/ε3)

Continuous vs One-Shot

If there is a continuous monitoring algorithm that communicates X bits, then there is a one-shot algorithms that communicates O(X+k) bits

Our Results

Good news: all continuous bounds (except F2) are close to their one-shot counterparts

Bad news: all continuous bounds (except F2) are close to their one-shot counterparts

Talk Outline

IntroductionDeterministic F1 algorithm: O(k log(1/ε))Randomized F1 algorithm: O(1/ε2∙log(1/δ))Randomized F0 algorithm: Õ(k/ε2)Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)Conclusions

Deterministic F1 Algorithm

The first round:

τ/2k

coordinator

Terminates round after receiving k signalsτ/2k · k = τ/2 < F1 < τ


The second round:

τ/4k

coordinator


The second round:

τ/4k

coordinator

Terminates round after receiving k signals3τ/4 < F1 < τ


Each round communicates O(k) bitsContinue until Δ=ετ O(log(1/ε)) rounds

Δ=ετ

coordinator

After the last round, we have (1-ε)τ < F1 < τ

Total communication: O(k log(1/ε))Lower bound: Ω(k log(1/(εk)))

One-Shot: O(k log(1/ε))Lower bound: Ω(k log(1/(εk)))

Talk Outline


F0: # Distinct Items

Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits

Consider the one-shot case firstUse “sketches”: small-space streaming

algorithms “Combine” the sketches from the k sitesFM sketch [Flajolet and Martin 1985; Alon, Matias,

and Szegedy, 1999]

FM Sketch

Take a pair-wise independent random hash function h : {1,…,n} {1,…,2d}, where 2d > n

For each incoming element x, compute h(x)e.g., h(5) = 10101100010000Count how many trailing zerosRemember the maximum number of trailing zeroes in

any h(x)Let Y be the maximum number of trailing zeroes

Can show E[2Y] = # distinct elements

FM Sketch

So 2Y is an unbiased estimator for # distinct elementsHowever, has a large variance

Some recent techniques [Gibbons and Tirthapura, 2001; Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce a good estimator that has probability 1–δ to be within relative error ε

Space increased to Õ(1/ε2)FM sketch has linearity

Y1 from A, Y2 from B, then 2max{Y1, Y2} estimates # distinct items in AUB

A one-shot algorithm with communication Õ(k/ε2)

Continuously Monitoring F0

FM sketch is monotoneYi is non-decreasing, and Yi < log nWhenever Yi increases, notify the coordinatorThe coordinator can always have the up-to-

date combined FM sketch Total communication: Õ(k/ε2)

Lower bound: Ω(k)

Talk Outline


F2: The One-Shot Case

Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits

Consider the one-shot case firstUse “sketches”: small-space streaming

algorithms “Combine” the sketches from the k sitesAMS sketch [Alon, Matias, and Szegedy, 1999]

AMS Sketch: “Tug-of-War”

Take a 4-wise independent random hash functionh : {1,…,n} {−1,+1}

Compute Y = ∑ h(x)

over all xY2 is an unbiased estimator for F2

Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε

Linearity still holds!o One-shot case can be solved with communication Õ(k/ε2)

However…

Y is not monotone!

Can’t afford to send all changes of the local sketch to the coordinator

F2 Monitoring: Multi-Round Algorithm

Beginning of a round

sketch Õ(1/ε2)sketch Õ(1/ε2)

estimate for F2

coordinator


During a round

estimate for F2

coordinator

sends a signal wheneverthe F2 of the updates increasesby t = (τ − F2)2/(64k2τ)


End of a round: when k signals are received

estimate for F2

coordinator

old F2 + (τ − old F2) ∙ ε/k < new F2 < τ

# rounds: O(k/ε)Total cost: Õ(k2/ε3)

F2: Round / Sub-Round Algorithm

End of a sub-round: when k signals are received

estimate for F2

coordinator

old F2 + (τ − old F2) ∙ ε/k < new F2 < τ

“rough” sketchof size Õ(1)

“rough” sketchof size Õ(1)

combine sketchesmaintain an upper bound of F2

k

Total cost: Õ(k2/ε+k3/2/ε3)

One-shot: Õ(k/ε2)Lower bound: Ω(k)

Open Problems

Still no clear separation between the one-shot model and the continuous model F2 is an interesting case

Many other functions f Statistics: entropy, heavy hitters Geometric measures: diameter, width, …

Variations of the model One-way vs two-way communication Does having a broadcast channel help? Sliding windows?

“Continuous Communication Complexity”?

Thank you!

Algorithms for Distributed Functional Monitoring

Documents

Transcript of Algorithms for Distributed Functional Monitoring