Post on 02-Jan-2016
description
Pushing group communication to the edge will enable radically new distributed
applications
Ken Birman
Cornell University
First, a premise
An industry driven by disruptive changes– Incremental advances are fine but don’t shake
things up very much– Those who quickly seize and fully exploit
disruptive technologies thrive while those who miss even a single major event are left behind
Our challenge as researchers?– Notice these opportunities first
Ingredients for disruptive change
Set the stage:– Pent up opportunity to leverage legacy app base– Infrastructure enablers suddenly in place– Potential for radically new / killer applications
Some sort of core problem to solve– Often demands out-of-the-box innovation
Deployable in easily used form– These days, developers demand integrated tools
Why group communication?
Or more specifically…– Why groups as opposed, say, to pub-sub?– Why does the edge of the network represent such a
big opportunity? Who would use it, and why?
Can software challenges (finally) be overcome? How to integrate with existing platforms?
Groups: A common denominator
Not a new idea… dates back 20 years or more– V system (Stanford) and Isis system (Cornell)
A group is a natural distributed abstraction– Can represent replicated data or services, shared keys or
other consistent state, leadership or other coordination abstractions, shared memory…
– These days would visualize a group as an object Can store pointer to it in file system name space Each has an associated type (“endpoint management class”)
A process opens as many groups as it likes (like files)
A very short history of groups
V system offered O/S level groups but stalled Isis added “strong semantics” and also many end-
user presentations, like pub-sub– Strong semantics indistinguishable from a single entity
implementing same abstraction– Our name for this model: virtual synchrony
Can be understood as a form of transactional serializability but with processes, groups, messages as components
Weaker than Paxos, which uses a strong consensus model
Pub-sub emerged as the most popular end-user API
Virtual Synchrony Model
crash
G0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t}
p
q
r
s
tr, s request to join
r,s added; state xfer
t added, state xfer
t requests to join
p fails
... to date, the ... to date, the only only widely adopted model for consistency andwidely adopted model for consistency andfault-tolerance in highly available networked applicationsfault-tolerance in highly available networked applications
Four models side-by-side
Traditional pub-sub as supported in current products– No guarantees, often scales poorly (1-1 TCP connection). – Versions that use IP multicast are prone to melt-down when stressed
Virtual synchrony– Fastest and most scalable of the “strongly consistent” models– Cheats on apparent synchrony whenever it can, reducing risk of a
correlated failure caused by a poison-pill multicast Paxos
– Closest fit to “consensus” agreement and f.tol. properties– Performance limited by 2-phase protocol required to achieve this– Virtual synchrony uses Paxos-like protocol to track group membership
Transactions– Most expensive at all: updates touch persistent storage– Execution model too constraining for high-speed comm. apps
Successes… and failures
Some successes:– New York Stock Exchange, Swiss Exchange, French Air
Traffic Control System, US Navy AEGIS, telephony, factory automation. Microsoft Vista clusters, IBM Websphere.
– Paxos popular for small fault-tolerance services
Some failures: – Implementations were often fragile, didn’t scale well, poorly
integrated with devel. environments– Pub-sub users tolerated weak semantics
Bottom line: Market didn’t scale adequately
Why didn’t the market scale?
Mile high perspective:– All existing group communication solutions targeted server
platforms, e.g. to replicate data in a clustered application– Pub-sub became a majority solution for sending data (for
example stock trades) from data centers to client platforms But neither market was ultimately all that large
– The number of server platforms is tiny compared to the number of client systems… and group communication isn’t the whole story – you always needed “more” technology
– Meanwhile, once you license pub-sub to every major trading floor you exhaust the associated revenue opportunity
Why didn’t either displace the other?
Pub-sub systems were best-effort technologies– Compete with systems that just make lots of TCP connections
and push data…– Lacking stronger semantics, applications that want security or
stronger reliability had to build extra end-to-end logic
But group communication systems had scalability issues of their own
– Most platforms focus on processes using a small number of small groups (often just one group of 3-5 members)
– Other “positioning” involves relaying through a central service
Is there an answer?
Retarget group communication towards the edge of the network!– Provide it as a direct client-to-client option– Integrate tightly into the dominant client
computing platform (.net, web services)– Make it scale much better
Now value is tied to number of client systems, not number of servers…
Potential roles for group communication at the edge of the net?
Gaming systems and VR immersion Delivery of streaming media, stock quotes, projected
pricing for stock and bonds… Replication of security keys, other system structuring
data and management information
Vision: Any client can securely produce and/or consume data streams… servers are present but in a “supporting role”
Recall our list of ingredients…
Pent up demand: developers have lacked a way to do this for decades…
Technology enabler: Availability of ubiquitous broadband connectivity, high bandwidths
Potential for high-value use in legacy apps and potential for new killer apps
... But can we overcome scalability limits?
Quicksilver: Krzys Ostrowski, Birman, Phanishayee, Dolev
Publish-subscribe eventing and notification Scalable in many dimensions
– Number of publishers, subscribers– Number of groups (topics)– Churn, failures, loss, perturbances– High data rates
Reliable Easy to use, and supporting standard APIs
0
2K
4K
6K
8K
10K
0 10 20 30 40 50 60 70 80 90 100 110
thro
ugh
put
(p
ack
ets
/s)
number of nodes
QSM (1 sender)QSM (2 senders)
JGroups
0
1K
2K
3K
4K
5K
6K
7K
8K
9K
2 8 32 128 512 2K 8K 1
2
4
8
16
32
64
128
256
thro
ugh
put
(p
acke
ts/s
)
me
mo
ry u
sag
e (
me
gab
yte
s)
number of groups
QSM throughput (110 nodes)
QSM memory usage
2 nodes8 nodes
32 nodes
64 nodes110 nodes
JGroups
Groups A 1..A100
Groups B1..B100
Groups C 1..C100
in 300 groups
sendingmessagesin multiple
groups
Signed up to 100 groups
Quicksilver: Key ideas Design dissemination, reliability,
security, virtual synchrony as concurrently active “stacks”
No need to relay multicasts through any form of centralized service…
Send typical message with a single (or a few) IP or overlay multicasts
Meta-protocols aggregate work across groups for efficiency
System is extensively optimized to maximize throughput in all respects
Quicksilver is a work in progress…
The basic scalable infrastructure is working today (coded in C#, runs on .net)
We’re currently adding:– Clean integration into .net to make it easy to use in
much the same way that “files” are used today– Scalable virtual synchrony protocols, may also offer
Paxos for those who want stronger model– Comprehensive, scalable security architecture
Planning a series of free releases from Cornell
Conclusions?
Enablers for a revolution at the edge– Groups that look like a natural part of .net, web services– Incredibly easy to use… much like shared files.– They scale well enough so that they can actually be used in
the ways you want… and offer powerful security and consistency guarantees for applications that need them
– Will also integrate with a persistency service to capture the history associated with a group if desired. Like a transactional log… but much faster and more flexible!
Shared group with strong properties enable a new generation of trustworthy applications
Extra Slides (provided by Krzys)
What to read? Our OSDI submission:
• QuickSilver Scalable Multicast. Krzysztof Ostrowski, Ken Birman, and Amar Phanishayee.
• On www.cs.cornell.edu/Projects/Quicksilver
• QSM itself is available for download today.
Non-Goals
We don’t aim at “real time” guarantees– We always try to deliver messages, even if late
We can sacrifice latency for throughput – Unavoidable trade-off
Buffering, scheduling (at high rates, systems aren’t idle)
– But we’re still in 10-30ms range for 1K messages
We don’t do pub-sub filtering– We provide multicast. Cayuga will do the filtering.
Groups A1..A100
Groups B1..B100
Groups C1..C100
in 300 groups
sendingmessagesin multiple
groups
Signed up to 100 groups
Why multiple groups?
Groups are out there in the wild...
Replication
Load-Balancing
Events
Shared Data
Lots of Groups
Server Groups
Why multiple groups?
Groups are easy to think of... Why not use them like we would use files?
– Separate group for each: Event Category of data items, user requests Stock Category of products Type of service
– May lead to new, easier ways of programming
Limitations of existing approaches
Existing protocols aren’t enough– Designed to scale in one dimesion at a time
Overheads Bottlenecks
– Costly to run (typically CPU-bound)– Example: JGroups
Popular, and considered a solid platform. Part of JBoss. Running in managed environment (Java)
Limitations of existing approaches
0
500
1000
1500
2000
2500
3000
3500
0 10 20 30 40 50 60 70 80 90 100 110
thro
ughp
ut (
pack
ets/
s)
number of nodes
JGroups
1 sender1 group
sendingas fast aspossible
cluster ofPIII 1.3 GHz
512 MB100 Mbps
Limitations of existing approaches
0
500
1000
1500
2000
2500
3000
3500
2 4 8 16 32 64 128 256 512
thro
ughp
ut (
pack
ets/
s)
number of groups
JGroups
8 nodes
32 nodes
64 nodes
110 nodes
1
2 nodes
1 sender
sendingas fast aspossible
all groupshave the
exact same
members
Limitations of existing approaches
Protocol 1
Protocol 3
Node
Protocol 2
Region
Lightweight groups: Overloaded agents Wasted bandwidth Filtering on receive Extra network hops
Protocol per group: ACK/NAK overload
QSM was tested on 110 nodes
0
2K
4K
6K
8K
10K
0 10 20 30 40 50 60 70 80 90 100 110
thro
ughp
ut (
pack
ets/
s)
number of nodes
QSM (1 sender)QSM (2 senders)
JGroups
1 group
sendingas fast aspossible
rates setmanually
QSM was tested on 8192 groups
0
1K
2K
3K
4K
5K
6K
7K
8K
9K
2 8 32 128 512 2K 8K 1
2
4
8
16
32
64
128
256
thro
ughp
ut (
pack
ets/
s)
mem
ory
usag
e (m
egab
ytes
)
number of groups
QSM throughput (110 nodes)
QSM memory usage
2 nodes8 nodes
32 nodes
64 nodes110 nodes
JGroups
1 sender
groupsperfectlyoverlap
QSM is very cheap to run
0 10 20 30 40 50 60 70 80 90
100
1K 2K 3K 4K 5K 6K 7K 8K 9K 10K
CP
U u
tiliz
atio
n (%
)
rate (messages/s)
sender
receiver
1 group110 nodes
1-2 senders
QSM is very cheap to run...
0
20
40
60
80
100
16 64 256 1K 4K 16K 64K
cpu
util
iza
tion
(%
)
message size (bytes)
sender
receiver
1 sender1 group
110 nodes
maximumrate
QSM has an acceptable latency...
0
10
20
30
40
50
1K 2K 3K 4K 5K 6K 7K 8K 9K 10K
late
ncy
(ms)
rate (packets/s)
16-byte messages
1000-bytes messages
...yet sometimes it needs tuning
10
100
1000
16 64 256 1K 4K 16K 64K
late
ncy
(ms)
message size (bytes)
the default buffering settings lead to higher latencies for large messages
lower latencies
achievable via manually
tuning its settings
QSM tolerates bursty packet loss
1 sender110 nodes
once every 10sa selected node (receiver) drops every incoming
packet including data and control for the period of 1s and returnsback to normal
for the remaining9 seconds
0 1 2 3 4 5 6 7 8 9
10
20 30 40 50 60 70 80 90 100 110 01K2K3K4K5K6K7K8K9K10K
late
ncy
(s)
thro
ughp
ut (
pack
ets/
s)number of nodes
receive latency
recovery latency
throughput (normal)
throughput (perturbed)
QSM tolerates bursty packet loss
1 sender110 nodes
loss occursevery 10sas before
duration ofthe loss
is varying
0.01
0.1
1
10
0.01 0.1 15K
6K
7K
8K
9K
10K
late
ncy
(s)
thro
ughp
ut (
pack
ets
/s)
duration of the loss (s)
receive latency
recovery latency
throughput (perturbed)
QSM tolerates node crashes...
900000
1e+006
1.1e+006
1.2e+006
325 335 345 355 365
num
ber
of p
acke
ts
time (s)
sentreceivedcleanup
node crashes
senderpauses
senderresumes
failure detected after 10 seconds
...and much worse scenarios
1.4e+006
1.45e+006
1.5e+006
1.55e+006
1.6e+006
1.65e+006
1.7e+006
1.75e+006
885 895 905 915 925 935
nu
mb
er
of
pa
cke
ts
time (s)
sentreceived (regular)received (perturbed)cleanup
worst casescenario
node freezesfor 10s in the
middle ofthe run, but
then resumesand triggersa substantial
amounf ofloss recovery
Cumulative effect of perturbances
0
10
20
30
40
50
60
20 30 40 50 60 70 80 90 100 110
cum
ulat
ive
dela
y (s
)
number of nodes
crash
freeze
cumulative delay
how much extra time we needto send the same N messagesas a result of the perturbance
QSM doesn’t collapse but might oscillate
0
2K
4K
6K
8K
10K
12K
250 350 450 550 650 750 850
thro
ug
hp
ut
(pa
cke
ts/s
)
time (s)
2 senders110 nodes
trying to send at a rate that exceeds themaximum ofwhat can be
achieved
Key Insight: Regions
G(x) = set of groups node x is a member of x and y are in the same region if G(x) = G(y) Interest sharing
– Receiving the same messages
Fate sharing– Experiening the same load, burstiness– Experiencing the same losses– Being similarly affected by churn, crashes etc.
Key Insights: Regions
ABC
B
A
AB
AC C
B
BC
A
C
A
B
CAC
BC
ABC
AB
node
region
Key Insights: Regions
Send in B
Send in A
Send in A
Send in C
A
B
C
A
AB
AC
ABC
B
C
BC
Applications Group Senders Region Senders
Send in A
Send in C
Send in C
Key Insights: Regions
inter-regionprotocolintra-region
protocol
Node
Region
Key Insights: Internals
KERNEL
I/OQueue
Alarm Queue
Core Thread
registerfeed
pull from feedApplications
(blocking, timed) dequeue
beginreceive
NIC
begin send
data
sentSockets send
sentreceived
(nonblocking)dequeue
enqueue
otherdowncall
other upcall
As of today...
Already available: Multicast scalable in multiple dimensions Simple reliability model (keep trying until ACK’ed) Simple messaging API
Work in progress: Extending with strong reliability Support for the WS-* APIs, „typed endpoints” Request-Reply communication
Deployment scenarios
Architecture with the local „daemon”– Support multiple processes more smoothly– Support WS-BrokeredNotification– Support WS-Eventing– Support non-.NET applications by linking with a
separate thin library Only needs to talk to the local „daemon” No need to implement any part of the QSM protocol Small, might be written in any language!
– Currently in progress...
Deployment scenarios
Application
QSM Library
Process 1
QSM Library
Web Service
Process 2
Web Service
Process 3
QSM Service(„Daemon”)
Node 1 Node 2QSM Library
QSM Service(„Daemon”)
Membership Service
Controller
Eventing
QuickSilver
Publisher
Node
Node
Node
QuickSilver
GMS
Application
Subscriber
QuickSilver
Daemon
Node
Application
Subscriber
QuickSilver
Daemon
Conclusions
QSM currently delivers multicast scalable in multiple dimesions, with basic reliability properties
Future: We are ading support for WS-* APIs We are extending the robustness and reliability Two new dimensions
Request-reply and pub-sub mode of communication Strong typing of groups (e.g. for security)
Publications
QuickSilver Scalable Multicast Krzysztof Ostrowski, Ken Birman, and Amar Phanishayee
Extensible Web Services Architecture for Notification in Large-Scale Systems Krzysztof Ostrowski and Ken Birman
http://www.cs.cornell.edu/projects/quicksilver/pubs.html