Reliable and Highly Available Distributed Publish/Subscribe Systems
description
Transcript of Reliable and Highly Available Distributed Publish/Subscribe Systems
![Page 1: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/1.jpg)
Reliable and Highly Available Distributed Publish/Subscribe
SystemsReza Sherafat
Hans-Arno Jacobsen
University of TorontoSeptember 2009
Symposium on Reliable and Distributed Systems
![Page 2: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/2.jpg)
SRDS'09 2
Many-to-many communication High-level operations: “subscribe” and “publish“ Decoupling between sources and sinks Flexible content-based messaging
Distributed Publish/Subscribe SystemsPub
Sub
Sub
Pub/Sub
SubscribeS
SS SS
S
S
Subscribe
PPublish PP
P
PPub
![Page 3: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/3.jpg)
SRDS'09 3
Existing approaches
δ-Fault-tolerance
Architecture
Reliable publication delivery protocol
Experimental results
Agenda
![Page 4: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/4.jpg)
SRDS'09 4
A copy is first preserved on disk and then forwarded
Intermediate hops send an ACK to previous hop after preserving
ACKed copies can be dismissed from disk
Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery◦ This ensures reliable delivery but may cause delays while the machine is down
Store-and-ForwardP P PPFrom
hereTohere
ackackack
![Page 5: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/5.jpg)
SRDS'09 5
Use a mesh network to concurrently forward messages on disjoint paths
Upon failures, the message is delivered using alternative routes
Pros: Minimal impact on delivery delay
Cons: Imposes additional traffic & possibility of duplicate delivery
Mesh-Based Overlay Networks [Snoeren, et al., SOSP 2001]
PPPP
Fromhere
Tohere
![Page 6: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/6.jpg)
SRDS'09 6
Replicas are grouped into virtual nodes Replicas have identical routing information
We compare against this approach in evaluation section
Virtual node
Replica-based Approach[Bhola et al., DSN 2002]
![Page 7: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/7.jpg)
SRDS'09 7
Existing approaches
δ-Fault-Tolerance
Architecture
Reliable publication delivery protocol
Experimental results
Next
![Page 8: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/8.jpg)
SRDS'09 8
In distributed messaging system◦ Failed brokers may be down for a long time◦ There often are concurrent failures◦ Reliable message delivery is essential
Configuration parameter δ
A δ-fault-tolerant P/S system ensures reliable delivery when there are up to δ concurrent crash failures
Reliability:◦ Exactly-once delivery of publications to matching subscribers◦ Per-source FIFO ordered message delivery
δ-Fault-Tolerance
![Page 9: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/9.jpg)
SRDS'09 9
Existing approaches
δ-Fault-Tolerance
Architecture
Reliable publication delivery protocols
Experimental results
Next
![Page 10: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/10.jpg)
SRDS'09 10
Broker are organized in a tree-based overlay network
In our approach δ-fault-tolerance is closely related to how much brokers know about the broker tree
(δ+1)-neighborhood: brokers within distance δ+1
This information is stored in a data structure called thetopology map◦ Topology maps are updated as
brokers enter/leave the network
Architecture
3-neighborhood
2-neighborhood
1-neighborhood
![Page 11: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/11.jpg)
SRDS'09 11
Join Algorithm1. Joining broker connects to a joinpoint
2. joinRequest message is sent to the joinpoint
3. joinpoint replies with a subset of its topology map
4. joinRequest is propagated in the network
5. Receiving brokers update their topology maps
6. confirmation messages propagated from edge brokers are sent back
7. Joining broker receives the confirmation: join is complete
δ-neighborhood
(δ+1)-neighborhood
Joinpoint
Joining broker
![Page 12: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/12.jpg)
12
S S SS
Subscription routing protocol is used to construct forwarding paths
Subscription messages encapsulate:◦ pred: Conjunct predicates specifying client’s interests◦ from: BrokerID points back to broker δ+1 hops closer to subscriber
Subscriptions are sent hop-by-hop throughout the network◦ Brokers update from as message is forwarded◦ Brokers handle confirmation msgs similar to join ◦ Confirmed subs are inserted into subscription routing table
Subscription Routing Information
A B C D E
s.from
s.froms.from s.from
SRDS'09
δ=2
![Page 13: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/13.jpg)
SRDS'09 13
Existing approaches
δ-Fault-Tolerance
Architecture
Reliable publication forwarding protocols
Experimental results
Next
![Page 14: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/14.jpg)
SRDS'09 14
queue
Publication Forwarding Algorithm (No Failure Case)
1. Received pubs are placed in a FIFO message queue and kept until processing is complete
2. Using subscription info: subsmatching the publication are identified
3. Matching subs’ from field are inserted into the recipientSet
4. Using topology map: pub is sent to closest available brokers towards matching subscribers (outgoingSet)
5. Receiving downstream brokers similarly forward the publication until delivered to subscribers
6. Confirmations from all downstream brokers are received
7. Clean-up: once all confirmations arrive, the publication is discarded from the queue
P
(δ+1)-neighborhood
Upst
ream
Down
stre
am
A
![Page 15: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/15.jpg)
SRDS'09 15
Publication Forwarding Algorithm (Failure Case)
Brokers use heartbeats to monitor availability of their connected peers
Once failures are detected the broker reconnects the topology by creating new links to downstream neighbors of the failed brokers
Unconfirmed publications are re-transmitted from msg queue
Subsequent pubs are forwarded via the new links instead
◦ Bypass failed brokers
Multiple concurrent failures (up to δ) are handled similarly
◦ In the worst case, δ brokers have failed in a row
queuePPPP
Upst
ream
Down
stre
am
A
![Page 16: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/16.jpg)
SRDS'09 16
For each pub msg sent over a link there is a confirmation msg that is sent back◦ Increased network traffic
We use an aggregated acknowledgement mechanism called Depth Acknowledgements (DACK)◦ It is very similar to the normal way that ◦ This substitutes the need for confirmation messages
Eliminating Need for Confirmation Messages
![Page 17: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/17.jpg)
SRDS'09 18
Discarding Publications Using DACK Messages
B and C keep track of the highest sequence number they received and discarded (prefix-based) from A and periodically report it upstream using DACK messages.
Brokers append their own information to DACK and also relay portions of their neighbors’ DACK messages.
For each publication, A evaluates safety conditions for all brokers in the publication’s recipientSet.
Safety conditions◦ All intermediate brokers report an arrived seq#
is higher than publication’s seq#, OR◦ Any intermediate broker has reported a
discarded prefix seq# that is higher that the publication’s seq#(necessary when there are failures)
C
B
A
P
arrived:{seq(A), …}discarded:{seq'(A), …}
arrived:{seq(A), seq(A),…}discarded:{seq'(A), seq'(A)},…
PP
DACK
DACK
Upst
ream
Down
stre
am
Arr:seq(A) Dsc:seq'(A)
Arr:seq(A)Dsc:seq'(A)
Update
Update
Update
Update
Update
Update
P(?)
![Page 18: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/18.jpg)
SRDS'09 19
Existing approaches
δ-Fault-Tolerance
Architecture
Reliable publication forwarding protocols
Experimental results
Next
![Page 19: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/19.jpg)
SRDS'09 20
Algorithms implemented in Java
We run the system on a cluster computer:◦ 21 nodes each with 4 cores◦ Gigabit eathernet
Topology setup (δ=3)◦ Consists of 83 brokers◦ #subscriptions: 2600◦ #publishers: 26 at
varied publication rates
We inject failures to R1, R2, R3 and perform measurements
Experimental Setup
R1 R2R3
![Page 20: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/20.jpg)
SRDS'09 21
Impact of failures on publication delivery delay◦ Use stream of publications (10msg/s)◦ Measure delivery delay between publishing and subscribing endpoints
3 separate runs with different number of simultaneous failures
After a short-lived jump, the delivery delay quickly goes back to normal ◦ Difference corresponds to
failure detection timeout
3-Failures2-Failures1-Failure
Publication Delivery Delay
![Page 21: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/21.jpg)
23
Non-faulty broker's load after failures:◦ Input msg traffic: no change!◦ Output msg traffic: increase◦ CPU utilization: increase
Output rate/CPU utilization is affected by nearby failures
Change in Load After Failures
R1 R2
R3 Fails R3 Fails
Spikes at R2 afterbrokers reconnect
R3
R2 stabilizesat slightly higher
Output Msg Rate CPU Load
Lower spikes on R1 Lower spikes on R1R1 sees no chance
R2’s output traffic stabilizes at slightlyhigher rate
R1sees no change
InputMsg Rate
R3 Fails
Smaller spike on R1
Spikes at R2 after brokers reconnect
R2’s input traffic stabilizes at exactlythe same rate
R1’s input traffic stabilizes at exactlythe same rate
Spikes at R2 after brokers reconnect
![Page 22: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/22.jpg)
SRDS'09 24
Topology network◦ Our approach: δ=2 ◦ Replica-based: 2 replicas◦ Considered situation after
2 failures (R2 and R3 fail)
Compared load on R1 after failures occur
In our approach CPU load on R1 is about 30% lower
Comparison with Replica-based Approach
Virtual nodeR2 R3R1
Our approach Replica-based
30% difference
![Page 23: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/23.jpg)
SRDS'09 25
Our system delivers reliable P/S service in the face of up to δ concurrent broker failures
We also proposed optimizations:◦ To use aggregated acknowledgement messages◦ To reduce the network traffic
Ongoing and future work:
◦ Explore multi-path forwarding
◦ http://research.msrg.utoronto.ca/Padres/WebHome
Conclusions
![Page 24: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/24.jpg)
SRDS'09 26
Thanks!
Questions?
![Page 25: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/25.jpg)
SRDS'09 27
Backup slides …
![Page 26: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/26.jpg)
Sample DACK Propagation and Publication Purging (δ=3)
1.2.3.4.5.6.
7.8.9.
10.
1.2.3.4.5.6.
7.8.9.
First safety condition Second safety condition
LEGEND: Node holds pub in MQ
Node discards pub
Node receives pub Direction of pub forwarding
![Page 27: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/27.jpg)
29
Using DACK info (δ=3)
SRDS'09
Publication Propagation and Purging
1.2.3.4.5.6.
7.8.9.
10.
1.2.3.4.5.6.
7.8.9.
![Page 28: Reliable and Highly Available Distributed Publish/Subscribe Systems](https://reader035.fdocuments.us/reader035/viewer/2022062521/568166d5550346895ddae720/html5/thumbnails/28.jpg)
SRDS'09 30
Using DACK info with failures (δ=3)
Publication Propagation and Purging