Navneet Kumar Pandey 1 Stéphane Weiss 1 Roman Vitenberg 1 Kaiwen Zhang 2 Hans-Arno Jacobsen 2 2...

21
Navneet Kumar Pandey 1 Stéphane Weiss 1 Roman Vitenberg 1 Kaiwen Zhang 2 Hans-Arno Jacobsen 2 2 University of Toronto 1 University of Oslo Minimizing the communication cost of aggregation in Publish/Subscribe systems

Transcript of Navneet Kumar Pandey 1 Stéphane Weiss 1 Roman Vitenberg 1 Kaiwen Zhang 2 Hans-Arno Jacobsen 2 2...

Navneet Kumar Pandey1

Stéphane Weiss1

Roman Vitenberg1

Kaiwen Zhang2

Hans-Arno Jacobsen2

2University of Toronto1University of Oslo

Minimizing the communication cost of aggregation in Publish/Subscribe systems

Aggregation in Pub/Sub systemsMotivation: Stock Market Application

2

Content provider: Stock exchanges

Aggregate subscription:Stock indicators (e.g. MACD)

Content subscriber: Brokers, buyers

Non-aggregate subscription: Stock updates

Motivation: Intelligent Transport System (ITS)

• Information providers: road sensors, crowdsourced mobile apps

• Information seekers: commuters, police, first responders, radio networks etc.

3http://www.wired.com/images_blogs/autopia/2012/08/12A914.jpg

• Aggregate subscriptions

• Count number of cars passing a street light per hour

• Average speed of cars on a road segment per day

• Non-aggregate subscriptions

• Accident reports

• Traffic violation reports

Objective: Aggregation in Pub/Sub

4

• Pub/sub is well known for efficient content filtering and dissemination for distributed event sources and sinks.

• However, pub/sub does not support aggregation, which is required in emerging applications.

• Our primary objective is to retain the traditional pub/sub focus on low communication cost, while adding support for aggregation.

• It is more communication- and computation-efficient than running two separate system for pub/sub and aggregation.

Contributions: aggregation in pub/sub

5

• We introduce and formalize the problem of minimizing communication for aggregation in pub/sub.

• We present a solution which is optimal under complete knowledge of publications and subscriptions by a broker.

• We evaluate the trade-off between comm. and comp. costs for these two solutions.

• By reducing the problem to a minimum-vertex-cover over bipartite graphs, we show that it is solvable in polynomial time.

• We propose an alternative algorithm which is less computationally expensive.

BI

Subscriber

P[val,8]A[val, > ,4]

S[val, > ,3]

Bp

Bq

BSBI

B Broker

Subscription Delivery Tree (SDT)

Background: Advertisement-based pub/sub model

6

Publishers

Our design choice:To maximize communication efficiency, we reuse dissemination flow i.e. SDTs.

Proposal: aggregation in Pub/Sub system

7

Aggregate Subscription: {<conditional predicates>, operator, duration (ω), shift size (δ)}

NWR1

NWR2

subsc

ripti

on

1 2 30 Time (in hours)

Notification window ranges (NWR)Pub1 Pub2 Pub3

ω

δ

Ex: { RoadID = 101, speed > 10, op=‘avg’ , ω = 2 hour, δ = 1 hour}

Challenge: Distribute the computation across the brokers

8

Result load

2

Publication load

3

Pub1

Publication messageResult message

Aggregation Decision

Broker

Pub2

Pub3 Res1

Res2

Pub1

Pub2

Pub3

SDT

Broker

NWR1

NWR2

subsc

ripti

on

1 2 30 Time

Pub1 Pub2 Pub3

NWR1

subsc

ripti

on

Pub1 Pub2 Pub3

1 2 30 Time

NWR2

NWR3

NWR4

Res1

Res2

Res3

Res4

Result load

4

Publication load

3

Local aggregation decision by each broker on an SDT for each NWR: Aggregate or forward incoming publications for that NWR.

Trade-off: multiple factors affect the decision

9

Increasing parameter Favors

Publication matching rate Aggregate

Number of matched NWRs Forward

Overlap among aggregate subscriptions Forward

Ratio between aggregate and regular subscriptions Aggregate

Challenges:• No global knowledge about topology.• SDTs are beyond control of the aggregation scheme.• SDTs get changed dynamically during the execution.

Unique challenges compared to other aggregation systems

10

Aggregation in pub/sub Other aggregation systems

Topology is not known to individual broker nodes

Require global view of the topology

Publication sources and sinks are dynamic

Require a priori knowledge of publication sources

Brokers are loosely coupled Need control layer

SDTs are dynamic and outside of control of aggregation scheme

Demand a static query plan

Publications come at an irregular rate Optimized for continuous data streams

Problem formulation: Minimum-Notification for Aggregation (MNA)

• Objective: • Given

• the set of subscriptions and,• set of incoming messages which includes both publications and previously

aggregated results• minimize the number of notifications

• i.e. publication and aggregation results sent by a broker.

11

: an NWR n : a Publication p

Na

Nb

Nc

P1

P2

P3

P4

Np Pp

: matching of a publication to NWR

NWRa

NWRc

P1 P2P3 P4

NWRb

Optimal solution

Unrealistic assumption: • Brokers have information about,

• all the matching publications• all the NWRs within entire execution

• this information is available a priori.

12

Na

Nb

Nc

P1

P2

P3

P4

Na

Nb

Nc

Idea:• Each broker constructs undirected bipartite graph (matching graph),• And computes the minimum vertex cover.

a minimum vertex cover = {Na, Nb, Nc}

• Computation cost:

Practical solution:Aggregation Decision, optimal with Complete Knowledge (ADOCK)

• Idea: Making decision with partial knowledge. Decisions are made based on current state of publications, NWRs and their interconnectivity.

• Implication: suboptimal decision.

13

Na

Nb

Nc

P1

P2

P3

P4

#Subscriptions 90 180 270 360

Difference betweenOptimal and ADOCK in %

3.53% 0.88% 4.29% 3.27%P1

P2

• Communication cost: Close to the optimal solution in the experiments.

|N|: #NWRs, |P|: #publications, degA(N) : average degree of NWR vertices

Scalability issue: Computation cost grows more than quadratically with the number of NWRs.

P3

P4

14

< 1(2/3)

Practical solution: Weighted Aggregation Decision (WAD)

Na

Nb

Nc

P1

P2

P3

P4

weight

1/3

1/3

1/2

1/2

Forward

Aggregate

Aggregate

P1

P2

Nb

Nc

Low computation cost: O(degA(N)x|N|)

• Reduce the number of vertices used for making a decision• Vertices within only 2 hops from the NWR will affect the decision.• Similar to ADOCK, take a decision per NWR

1. Assign a weight to a publication vertex which is inverse of its degree.2. Compute cumulative weight of an NWR from matching publications.3. Aggregate matching publications if cumulative degree ≥ 1.

Idea:

Steps:

≥ 11

≥ 11

Experimental setup• Implemented in Java over the PADRES framework• Topology: 16 brokers

– Combination of publisher-edge only, subscriber-edge only and mixed brokers

• Real life datasets: • Traffic dataset from the ONE-ITS service1

• Yahoo! Finance Stock dataset• Metrics:

• Communication: Number of messages exchanged• Computation: Total computation overhead

• Existing baseline: per Broker Adaptive Technique (BAT)15

B B B B

B

B

B

B

B

B B

B

B

B B

B

1http://one-its-webapp1.transport.utoronto.ca

Varying number of publications

16

Setting: Stock dataset

Computation costCommunication cost

• Trade-off between WAD and ADOCK over communication and computation cost.• WAD is up-to 73% faster than ADOCK at the expense of up-to 22% increase in

communication cost. • BAT sends more messages than either of the proposed solutions.

Varying number of subscriptions

17

Computation costCommunication cost

• ADOCK’s communication cost is around 12% lower than WAD’s.• However, ADOCK’s computation overhead is more than twice that of WAD.• This is also supported by analytical findings

ADOCK

WAD

Impact of sliding windows

18

• A higher sliding parameter increases the NWR interconnectivity and makes the decision graph big.

• ADOCK is up-to four times slower than WAD, while WAD is sending up-to 27% extra messages.

1 2 3 4 5 60%

50%

100%

150%

200%

250%

300%

350%CPU Overhead (ADOCK-WAD)/WADMessage Overhead (WAD-ADOCK)/ADOCK

Duration (ω) to shift size (δ) ratio

Rat

io

Key lessons from experiments• Our results confirm that interconnectivity is the key reason for the

trade-off between computation and communication cost.

• Trade-off is substantially affected by these factors:• Publication matching rate.• Number of matching NWRs.• Overlap among aggregate subscriptions.• Ratio between aggregate and regular subscriptions.

• Recommendation• ADOCK is preferred, if the system expects moderate amount of

subscriptions with high selectivity. • Otherwise, WAD is recommended.

19

Conclusions

20

• We formalize the MNA problem and reduce it to Minimum Vertex Cover over a bipartite graph.

• We provide two solutions: communication efficient ADOCK and computation efficient WAD.

• We experimentally demonstrate the trade-off between computation and communication cost in these approaches.

Thank you!

21

For questions & comments

[email protected]