Pushing group communication to the edge will enable radically new distributed applications

Pushing group communication to the edge will enable radically new distributed

applications

Ken Birman

Cornell University

First, a premise

An industry driven by disruptive changes– Incremental advances are fine but don’t shake

things up very much– Those who quickly seize and fully exploit

disruptive technologies thrive while those who miss even a single major event are left behind

Our challenge as researchers?– Notice these opportunities first

Ingredients for disruptive change

Set the stage:– Pent up opportunity to leverage legacy app base– Infrastructure enablers suddenly in place– Potential for radically new / killer applications

Some sort of core problem to solve– Often demands out-of-the-box innovation

Deployable in easily used form– These days, developers demand integrated tools

Why group communication?

Or more specifically…– Why groups as opposed, say, to pub-sub?– Why does the edge of the network represent such a

big opportunity? Who would use it, and why?

Can software challenges (finally) be overcome? How to integrate with existing platforms?

Groups: A common denominator

Not a new idea… dates back 20 years or more– V system (Stanford) and Isis system (Cornell)

A group is a natural distributed abstraction– Can represent replicated data or services, shared keys or

other consistent state, leadership or other coordination abstractions, shared memory…

– These days would visualize a group as an object Can store pointer to it in file system name space Each has an associated type (“endpoint management class”)

A process opens as many groups as it likes (like files)

A very short history of groups

V system offered O/S level groups but stalled Isis added “strong semantics” and also many end-

user presentations, like pub-sub– Strong semantics indistinguishable from a single entity

implementing same abstraction– Our name for this model: virtual synchrony

Can be understood as a form of transactional serializability but with processes, groups, messages as components

Weaker than Paxos, which uses a strong consensus model

Pub-sub emerged as the most popular end-user API

Virtual Synchrony Model

G0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t}

tr, s request to join

r,s added; state xfer

t added, state xfer

t requests to join

p fails

... to date, the ... to date, the only only widely adopted model for consistency andwidely adopted model for consistency andfault-tolerance in highly available networked applicationsfault-tolerance in highly available networked applications

Four models side-by-side

Traditional pub-sub as supported in current products– No guarantees, often scales poorly (1-1 TCP connection). – Versions that use IP multicast are prone to melt-down when stressed

Virtual synchrony– Fastest and most scalable of the “strongly consistent” models– Cheats on apparent synchrony whenever it can, reducing risk of a

correlated failure caused by a poison-pill multicast Paxos

– Closest fit to “consensus” agreement and f.tol. properties– Performance limited by 2-phase protocol required to achieve this– Virtual synchrony uses Paxos-like protocol to track group membership

Transactions– Most expensive at all: updates touch persistent storage– Execution model too constraining for high-speed comm. apps

Successes… and failures

Some successes:– New York Stock Exchange, Swiss Exchange, French Air

Traffic Control System, US Navy AEGIS, telephony, factory automation. Microsoft Vista clusters, IBM Websphere.

– Paxos popular for small fault-tolerance services

Some failures: – Implementations were often fragile, didn’t scale well, poorly

integrated with devel. environments– Pub-sub users tolerated weak semantics

Bottom line: Market didn’t scale adequately

Why didn’t the market scale?

Mile high perspective:– All existing group communication solutions targeted server

platforms, e.g. to replicate data in a clustered application– Pub-sub became a majority solution for sending data (for

example stock trades) from data centers to client platforms But neither market was ultimately all that large

– The number of server platforms is tiny compared to the number of client systems… and group communication isn’t the whole story – you always needed “more” technology

– Meanwhile, once you license pub-sub to every major trading floor you exhaust the associated revenue opportunity

Why didn’t either displace the other?

Pub-sub systems were best-effort technologies– Compete with systems that just make lots of TCP connections

and push data…– Lacking stronger semantics, applications that want security or

stronger reliability had to build extra end-to-end logic

But group communication systems had scalability issues of their own

– Most platforms focus on processes using a small number of small groups (often just one group of 3-5 members)

– Other “positioning” involves relaying through a central service

Is there an answer?

Retarget group communication towards the edge of the network!– Provide it as a direct client-to-client option– Integrate tightly into the dominant client

computing platform (.net, web services)– Make it scale much better

Now value is tied to number of client systems, not number of servers…

Potential roles for group communication at the edge of the net?

Gaming systems and VR immersion Delivery of streaming media, stock quotes, projected

pricing for stock and bonds… Replication of security keys, other system structuring

data and management information

Vision: Any client can securely produce and/or consume data streams… servers are present but in a “supporting role”

Recall our list of ingredients…

Pent up demand: developers have lacked a way to do this for decades…

Technology enabler: Availability of ubiquitous broadband connectivity, high bandwidths

Potential for high-value use in legacy apps and potential for new killer apps

... But can we overcome scalability limits?

Quicksilver: Krzys Ostrowski, Birman, Phanishayee, Dolev

Publish-subscribe eventing and notification Scalable in many dimensions

– Number of publishers, subscribers– Number of groups (topics)– Churn, failures, loss, perturbances– High data rates

Reliable Easy to use, and supporting standard APIs

0 10 20 30 40 50 60 70 80 90 100 110

number of nodes

QSM (1 sender)QSM (2 senders)

JGroups

2 8 32 128 512 2K 8K 1

number of groups

QSM throughput (110 nodes)

QSM memory usage

2 nodes8 nodes

32 nodes

64 nodes110 nodes

JGroups

Groups A 1..A100

Groups B1..B100

Groups C 1..C100

in 300 groups

sendingmessagesin multiple

groups

Signed up to 100 groups

Quicksilver: Key ideas Design dissemination, reliability,

security, virtual synchrony as concurrently active “stacks”

No need to relay multicasts through any form of centralized service…

Send typical message with a single (or a few) IP or overlay multicasts

Meta-protocols aggregate work across groups for efficiency

System is extensively optimized to maximize throughput in all respects

Quicksilver is a work in progress…

The basic scalable infrastructure is working today (coded in C#, runs on .net)

We’re currently adding:– Clean integration into .net to make it easy to use in

much the same way that “files” are used today– Scalable virtual synchrony protocols, may also offer

Paxos for those who want stronger model– Comprehensive, scalable security architecture

Planning a series of free releases from Cornell

Conclusions?

Enablers for a revolution at the edge– Groups that look like a natural part of .net, web services– Incredibly easy to use… much like shared files.– They scale well enough so that they can actually be used in

the ways you want… and offer powerful security and consistency guarantees for applications that need them

– Will also integrate with a persistency service to capture the history associated with a group if desired. Like a transactional log… but much faster and more flexible!

Shared group with strong properties enable a new generation of trustworthy applications

Extra Slides (provided by Krzys)

What to read? Our OSDI submission:

• QuickSilver Scalable Multicast. Krzysztof Ostrowski, Ken Birman, and Amar Phanishayee.

• On www.cs.cornell.edu/Projects/Quicksilver

• QSM itself is available for download today.

Non-Goals

We don’t aim at “real time” guarantees– We always try to deliver messages, even if late

We can sacrifice latency for throughput – Unavoidable trade-off

Buffering, scheduling (at high rates, systems aren’t idle)

– But we’re still in 10-30ms range for 1K messages

We don’t do pub-sub filtering– We provide multicast. Cayuga will do the filtering.

Groups A1..A100

Groups B1..B100

Groups C1..C100

in 300 groups

sendingmessagesin multiple

groups

Signed up to 100 groups

Why multiple groups?

Groups are out there in the wild...

Replication

Load-Balancing

Events

Shared Data

Lots of Groups

Server Groups

Why multiple groups?

Groups are easy to think of... Why not use them like we would use files?

– Separate group for each: Event Category of data items, user requests Stock Category of products Type of service

– May lead to new, easier ways of programming

Limitations of existing approaches

Existing protocols aren’t enough– Designed to scale in one dimesion at a time

Overheads Bottlenecks

– Costly to run (typically CPU-bound)– Example: JGroups

Popular, and considered a solid platform. Part of JBoss. Running in managed environment (Java)

0 10 20 30 40 50 60 70 80 90 100 110

number of nodes

JGroups

1 sender1 group

sendingas fast aspossible

cluster ofPIII 1.3 GHz

512 MB100 Mbps

2 4 8 16 32 64 128 256 512

number of groups

JGroups

8 nodes

32 nodes

64 nodes

110 nodes

2 nodes

1 sender

all groupshave the

exact same

members

Protocol 1

Protocol 3

Protocol 2

Region

Lightweight groups: Overloaded agents Wasted bandwidth Filtering on receive Extra network hops

Protocol per group: ACK/NAK overload

QSM was tested on 110 nodes

0 10 20 30 40 50 60 70 80 90 100 110

number of nodes

QSM (1 sender)QSM (2 senders)

JGroups

1 group

rates setmanually

QSM was tested on 8192 groups

2 8 32 128 512 2K 8K 1

number of groups

QSM throughput (110 nodes)

QSM memory usage

2 nodes8 nodes

32 nodes

64 nodes110 nodes

JGroups

1 sender

groupsperfectlyoverlap

QSM is very cheap to run

0 10 20 30 40 50 60 70 80 90

1K 2K 3K 4K 5K 6K 7K 8K 9K 10K

rate (messages/s)

sender

receiver

1 group110 nodes

1-2 senders

QSM is very cheap to run...

16 64 256 1K 4K 16K 64K

message size (bytes)

sender

receiver

1 sender1 group

110 nodes

maximumrate

QSM has an acceptable latency...

1K 2K 3K 4K 5K 6K 7K 8K 9K 10K

rate (packets/s)

16-byte messages

1000-bytes messages

...yet sometimes it needs tuning

16 64 256 1K 4K 16K 64K

message size (bytes)

the default buffering settings lead to higher latencies for large messages

lower latencies

achievable via manually

tuning its settings

QSM tolerates bursty packet loss

1 sender110 nodes

once every 10sa selected node (receiver) drops every incoming

packet including data and control for the period of 1s and returnsback to normal

for the remaining9 seconds

0 1 2 3 4 5 6 7 8 9

20 30 40 50 60 70 80 90 100 110 01K2K3K4K5K6K7K8K9K10K

s)number of nodes

receive latency

recovery latency

throughput (normal)

throughput (perturbed)

QSM tolerates bursty packet loss

1 sender110 nodes

loss occursevery 10sas before

duration ofthe loss

is varying

0.01 0.1 15K

duration of the loss (s)

receive latency

recovery latency

throughput (perturbed)

QSM tolerates node crashes...

900000

1e+006

1.1e+006

1.2e+006

325 335 345 355 365

time (s)

sentreceivedcleanup

node crashes

senderpauses

senderresumes

failure detected after 10 seconds

...and much worse scenarios

1.4e+006

1.45e+006

1.5e+006

1.55e+006

1.6e+006

1.65e+006

1.7e+006

1.75e+006

885 895 905 915 925 935

time (s)

sentreceived (regular)received (perturbed)cleanup

worst casescenario

node freezesfor 10s in the

middle ofthe run, but

then resumesand triggersa substantial

amounf ofloss recovery

Cumulative effect of perturbances

20 30 40 50 60 70 80 90 100 110

number of nodes

freeze

cumulative delay

how much extra time we needto send the same N messagesas a result of the perturbance

QSM doesn’t collapse but might oscillate

250 350 450 550 650 750 850

time (s)

2 senders110 nodes

trying to send at a rate that exceeds themaximum ofwhat can be

achieved

Key Insight: Regions

G(x) = set of groups node x is a member of x and y are in the same region if G(x) = G(y) Interest sharing

– Receiving the same messages

Fate sharing– Experiening the same load, burstiness– Experiencing the same losses– Being similarly affected by churn, crashes etc.

Key Insights: Regions

region

Send in B

Send in A

Send in C

Applications Group Senders Region Senders

Send in A

Send in C

inter-regionprotocolintra-region

protocol

Region

Key Insights: Internals

KERNEL

I/OQueue

Alarm Queue

Core Thread

registerfeed

pull from feedApplications

(blocking, timed) dequeue

beginreceive

begin send

sentSockets send

sentreceived

(nonblocking)dequeue

enqueue

otherdowncall

other upcall

As of today...

Already available: Multicast scalable in multiple dimensions Simple reliability model (keep trying until ACK’ed) Simple messaging API

Work in progress: Extending with strong reliability Support for the WS-* APIs, „typed endpoints” Request-Reply communication

Deployment scenarios

Architecture with the local „daemon”– Support multiple processes more smoothly– Support WS-BrokeredNotification– Support WS-Eventing– Support non-.NET applications by linking with a

separate thin library Only needs to talk to the local „daemon” No need to implement any part of the QSM protocol Small, might be written in any language!

– Currently in progress...

Deployment scenarios

Application

QSM Library

Process 1

QSM Library

Web Service

Process 2

Web Service

Process 3

QSM Service(„Daemon”)

Node 1 Node 2QSM Library

QSM Service(„Daemon”)

Membership Service

Controller

Eventing

QuickSilver

Publisher

QuickSilver

Application

Subscriber

QuickSilver

Daemon

Application

Subscriber

QuickSilver

Daemon

Conclusions

QSM currently delivers multicast scalable in multiple dimesions, with basic reliability properties

Future: We are ading support for WS-* APIs We are extending the robustness and reliability Two new dimensions

Request-reply and pub-sub mode of communication Strong typing of groups (e.g. for security)

Publications

QuickSilver Scalable Multicast Krzysztof Ostrowski, Ken Birman, and Amar Phanishayee

Extensible Web Services Architecture for Notification in Large-Scale Systems Krzysztof Ostrowski and Ken Birman

http://www.cs.cornell.edu/projects/quicksilver/pubs.html

Pushing group communication to the edge will enable radically new distributed applications

Documents

Transcript of Pushing group communication to the edge will enable radically new distributed applications

Radically Simplified GPU Parallelization: The Alea ...quantaleablog.azurewebsites.net/wp-content/uploads/... · Radically Simplified GPU Parallelization: The Alea Dataflow Programming

pushing 30

Arrow Pushing

Radically Yours! #06

Radically Better Food

Photocuring of Radically Polymerizable Hyperbranched ...

USING AUTOMATED SOLDER PASTE INSPECTION TO … · USING AUTOMATED SOLDER PASTE INSPECTION TO ... Depends on area ratio of stencil aperture ! ... enable pushing 0.66 area ratio with

Principles, Parameters, and Schemata : A radically ...

Energy-Efficient Video Processing for Virtual Reality · 2019-05-21 · Virtual reality (VR) has huge potential to enable radically new applications, behind which spherical panoramic

Radically Yours! #01

Data Pushing: Developing Dashboards to Enable Campus Leaders to Access Data on Demand November 7, 2013 John Standard UW-Parkside.

Unlocking true growth G20: Insights from the Social Progress … · 2020. 9. 19. · that enable new approaches at radically lower costs New business models that disrupt & quickly

Radically Simple

Twelve Forces that Will Radically Change the Future of …image-src.bcg.com/Images/BCG-Twelve-Forces-that-Will-Radically... · 6 Twelve Forces That Will Radically Change How Organizations

Radically Rethinking Computing - Zurich › pdf › 2013pressday › Radically_Rethinki… · Radically Rethinking Computing: Redox Flow Battery and Electronic Blood. ... Green Datacenter

Radically Open at the National Archives

Headshots: Pushing Through - idabelallen.netidabelallen.net/.../PRINT-READY-Headshots-Pushing... · HEADSHOTS:PUSHINGTHROUGH inthemirror,somethingIcan’thardlyrecognizeorbelieve.

Radically open documentation

SALAR Change Radically

Pushing poverty