An Overlay Infrastructure for Decentralized Object Location and Routing

Post on 25-Feb-2016

55 views 6 download

description

An Overlay Infrastructure for Decentralized Object Location and Routing. Ben Y. Zhao ravenben@eecs.berkeley.edu University of California at Berkeley Computer Science Division. Peer-based Distributed Computing. Cooperative approach to large-scale applications - PowerPoint PPT Presentation

Transcript of An Overlay Infrastructure for Decentralized Object Location and Routing

An Overlay Infrastructure for Decentralized Object Location and Routing

Ben Y. Zhaoravenben@eecs.berkeley.edu

University of California at BerkeleyComputer Science Division

April 22, 2023 ravenben@eecs.berkeley.edu 2

Peer-based Distributed Computing

Cooperative approach to large-scale applications peer-based: available resources scale w/ # of participants better than client/server: limited resources & scalability

Large-scale, cooperative applications are coming content distribution networks (e.g. FastForward) large-scale backup / storage utilities

leverage peers’ storage for higher resiliency / availability cooperative web caching application-level multicast

video on-demand, streaming movies

April 22, 2023 ravenben@eecs.berkeley.edu 3

What Are the Technical Challenges? File system: replicate files for resiliency/performance

how do you find close by replicas? how does this scale to millions of users? billions of files?

April 22, 2023 ravenben@eecs.berkeley.edu 4

Node Membership Changes Nodes join and leave the overlay, or fail

data or control state needs to know about available resources node membership management a necessity

April 22, 2023 ravenben@eecs.berkeley.edu 5

A Fickle Internet Internet disconnections are not rare (UMichTR98,IMC02)

TCP retransmission is not enough, need route-around IP route repair takes too long: IS-IS 5s, BGP 3-15mins good end-to-end performance requires fast response to faults

April 22, 2023 ravenben@eecs.berkeley.edu 6

reliable comm.reliable comm.reliable comm.reliable communication

dynamic membershipdynamic membershipdynamic membershipdynamic node membership algorithms

data locationdata locationdata locationefficient, scalable data location

FastForwardYahoo IMSETI

An Infrastructure Approach First generation of large-scale apps: vertical approach Hard problems, difficult to get right

instead, solve common challenges once build single overlay infrastructure at application layer

Internet

networktransportsession

presentationoverlay

physicallink

application

April 22, 2023 ravenben@eecs.berkeley.edu 7

Personal Research Roadmap

Tapestry

robust dynamicalgorithms

structuredoverlay APIs

resilientoverlay routing

WAN deployment(1500+ downloads)

SPAA 02 / TOCS IPTPS 03ICNP 03

JSAC 04

landmark routing(Brocade)

IPTPS 02

DOLR

PRR 97

multicast(Bayeux)NOSSDAV 02

file system(Oceanstore)

ASPLOS99/FAST03

spam filtering(SpamWatch)

Middleware 03

rapid mobility(Warp)IPTPS 04

a p p l i c a t i o n s

service discoveryservice

XSet lightweightXML DB

Mobicom 99 5000+ downloads

TSpaces

modeling of non-stationary datasets

April 22, 2023 ravenben@eecs.berkeley.edu 8

Talk Outline Motivation

Decentralized object location and routing Resilient routing Tapestry deployment performance Wrap-up

What should this infrastructure look like?

here is one appealing direction…

April 22, 2023 ravenben@eecs.berkeley.edu 10

Node IDs and keys from randomized namespace (SHA-1) incremental routing towards destination ID each node has small set of outgoing routes, e.g. prefix

routing log (n) neighbors per node, log (n) hops between any node

pair

Structured Peer-to-Peer Overlays

To: ABCD

ID: ABCE

A930AB5F

ABC0

April 22, 2023 ravenben@eecs.berkeley.edu 11

Related Work Unstructured Peer to Peer Approaches

Napster, Gnutella, KaZaa probabilistic search (optimized for the hay, not the needle) locality-agnostic routing (resulting in high network b/w costs)

Structured Peer to Peer Overlays the first protocols (2001): Tapestry, Pastry, Chord, CAN then: Kademlia, SkipNet, Viceroy, Symphony, Koorde,

Ulysseus… distinction: how to choose your neighbors

Tapestry, Pastry: latency-optimized routing mesh distinction: application interface

distributed hash table: put (key, data); data = get (key); Tapestry: decentralized object location and routing

April 22, 2023 ravenben@eecs.berkeley.edu 12

Defining the Requirements1. efficient routing to nodes and data

o low routing stretch (ratio of latency to shortest path distance)

2. flexible data locationo applications want/need to control data placement

o allows for application-specific performance optimizationso directory interface

publish (ObjID), RouteToObj(ObjID, msg)

3. resilient and responsive to faults o more than just retransmission, route around failureso reduce negative impact (loss/jitter) on the application

April 22, 2023 ravenben@eecs.berkeley.edu 13

backbone

Decentralized Object Location & Routing

redirect data traffic using log(n) in-network redirection pointers average # of pointers/machine: log(n) * avg files/machine

keys to performance proximity-enabled routing mesh with routing convergence

k

k

publish(k)

routeobj(k)

routeobj(k)

April 22, 2023 ravenben@eecs.berkeley.edu 14

Why Proximity Routing?

Fewer/shorter IP hops: shorter e2e latency, less bandwidth/congestion, less likely to cross broken/lossy links

0123401234

April 22, 2023 ravenben@eecs.berkeley.edu 15

Performance Impact (Proximity)

Simulated Tapestry w/ and w/o proximity on 5000 node transit-stub network Measure pair-wise routing stretch between 200 random nodes

Prefix Routing w/ and w/o Proximity

0

20

40

60

80

100

120M

ean

Rout

ing

Stre

tch

Ideal (RDP=1)ProximityRandomized

Ideal (RDP=1) 1 1 1Proximity 1.46 1.73 1.79Randomized 108.31 15.81 4.46

In-LAN In-WAN Far-WAN

April 22, 2023 ravenben@eecs.berkeley.edu 16

DOLR vs. Distributed Hash Table

DHT: hash content name replica placement modifications replicating new version into DHT

DOLR: app places copy near requests, overlay routes msgs to it

April 22, 2023 ravenben@eecs.berkeley.edu 17

Performance Impact (DOLR)

simulated Tapestry w/ DOLR and DHT interfaces on 5000 node T-S measure route to object latency from clients in 2 stub networks DHT: 5 object replicas DOLR: 1 replica placed in each stub network

0

200

400

600

800

1000

1200

1400

64 256 1024 4096

Overlay Size

Ave

rage

Rou

ting

Late

ncy

DHT Min DHT Avg DHT Max DOLR

April 22, 2023 ravenben@eecs.berkeley.edu 18

Talk Outline Motivation Decentralized object location and routing

Resilient and responsive routing Tapestry deployment performance Wrap-up

How do you get fast responses to faults?

Response time = fault-detection + alternate path discovery + time to switch

April 22, 2023 ravenben@eecs.berkeley.edu 20

Fast Response via Static Resiliency Reducing fault-detection time

monitor paths to neighbors with periodic UDP probes O(log(n)) neighbors: higher frequency w/ low bandwidth exponentially weighted moving average for link quality estimation

avoid route flapping due to short term loss artifacts loss rate: Ln = (1 - ) Ln-1 + p

Eliminate synchronous backup path discovery actively maintain redundant paths, redirect traffic immediately repair redundancy asynchronously

create and store backups at node insertion restore redundancy via random pair-wise queries after failures

End result fast detection + precomputed paths = increased responsiveness

April 22, 2023 ravenben@eecs.berkeley.edu 21

Routing Policies Use estimated overlay link

quality to choose shortest “usable” link

Use shortest overlay link withminimal quality > T

Alternative policies prioritize low loss over latency

use least lossy overlay link use path w/ minimal “cost

function”cf = x latency + y loss rate

April 22, 2023 ravenben@eecs.berkeley.edu 22

Talk Outline Motivation Decentralized object location and routing Resilient and responsive routing

Tapestry deployment performance Wrap-up

April 22, 2023 ravenben@eecs.berkeley.edu 23

Tapestry, a DOLR Protocol Routing based on incremental prefix matching Latency-optimized routing mesh

nearest neighbor algorithm (HKRZ02) supports massive failures and large group joins

Built-in redundant overlay links 2 backup links maintained w/ each primary

Use “objects” as endpoints for rendezvous nodes publish names to announce their presence e.g. wireless proxy publishes nearby laptop’s ID e.g. multicast listeners publish multicast session name to self

organize

April 22, 2023 ravenben@eecs.berkeley.edu 24

Weaving a Tapestry

Existing Tapestry

inserting node (0123) into network1. route to own ID, find 012X nodes, fill last column2. request backpointers to 01XX nodes3. measure distance, add to rTable4. prune to nearest K nodes5. repeat 2—4

ID = 0123XXXX 0XXX 01XX 012X

1XXX2XXX3XXX

00XX

02XX03XX

010X011X

013X

012001210122

April 22, 2023 ravenben@eecs.berkeley.edu 25

Implementation Performance Java implementation

35000+ lines in core Tapestry, 1500+ downloads Micro-benchmarks

per msg overhead: ~ 50s, most latency from byte copying

performance scales w/ CPU speedup 5KB msgs on P-IV 2.4Ghz: throughput ~ 10,000 msgs/sec

Routing stretch route to node: < 2 route to objects/endpoints: < 3

higher stretch for close by objects

April 22, 2023 ravenben@eecs.berkeley.edu 26

Responsiveness to Faults (PlanetLab)

B/W network size N, N=300 7KB/s/node, N=106 20KB/s sim: if link failure < 10%, can route around 90% of survivable failures

0

500

1000

1500

2000

2500

0 200 400 600 800 1000 1200

Link Probe Period (ms)

Tim

e to

Sw

itch

Rou

tes

(ms)

alpha=0.2alpha=0.4

300

660 = 0.2 = 0.4

April 22, 2023 ravenben@eecs.berkeley.edu 27

0

100

200

0 5 10 15 20 25 30

Time (minutes)

Succ

ess

Rat

e (%

)

0

50

100

150

200

250

Net

wor

k Si

ze

Stability Under Membership Changes

Routing operations on 40 node Tapestry cluster Churn: nodes join/leave every 10 seconds, average lifetime = 2mins

success rate (%)

killnodes

largegroup join

constantchurn

April 22, 2023 ravenben@eecs.berkeley.edu 28

Talk Outline Motivation Decentralized object location and routing Resilient and responsive routing Tapestry deployment performance

Wrap-up

April 22, 2023 ravenben@eecs.berkeley.edu 29

Lessons and Takeaways Consider system constraints in algorithm design

limited by finite resources (e.g. file descriptors, bandwidth) simplicity wins over small performance gains

easier adoption and faster time to implementation Wide-area state management (e.g. routing state)

reactive algorithm for best-effort, fast response proactive periodic maintenance for correctness

Naïve event programming model is too low-level much code complexity from managing stack state

important for protocols with asychronous control algorithms need explicit thread support for callbacks / stack

management

April 22, 2023 ravenben@eecs.berkeley.edu 30

Future Directions Ongoing work to explore p2p application space

resilient anonymous routing, attack resiliency Intelligent overlay construction

router-level listeners allow application queries efficient meshes, fault-independent backup links, failure notify

Deploying and measuring a lightweight peer-based application focus on usability and low overhead p2p incentives, security, deployment meet the real world

A holistic approach to overlay security and control p2p good for self-organization, not for security/ management decouple administration from normal operation explicit domains / hierarchy for configuration, analysis, control

Thanks!

Questions, comments?

ravenben@eecs.berkeley.edu

April 22, 2023 ravenben@eecs.berkeley.edu 32

Impact of Correlated Events

web / application servers independent requests maximize individual throughput

Network

???

???

?

ABC

correlated requests: A+B+CD e.g. online continuous queries, sensor

aggregation, p2p control layer, streaming data mining

event handler

+ + =

April 22, 2023 ravenben@eecs.berkeley.edu 33

Some Details Simple fault detection techniques

periodically probe overlay links to neighbors exponentially weighted moving average for link quality estimation

avoid route flapping due to short term loss artifacts loss rate: Ln = (1 - ) Ln-1 + p p = instantaneous loss rate, = filter constant

other techniques topics of open research How do we get and repair the backup links?

each hop has flexible routing constraint e.g. in prefix routing, 1st hop just requires 1 fixed digit backups always available until last hop to destination

create and store backups at node insertion restore redundancy via random pair-wise queries after failures

e.g. to replace 123X neighbor, talk to local 12XX neighbors

April 22, 2023 ravenben@eecs.berkeley.edu 34

Route Redundancy (Simulator)

Simulation of Tapestry, 2 backup paths per routing entry 2 backups: low maintenance overhead, good resiliency

00.10.20.30.40.50.60.70.80.9

1

0 0.05 0.1 0.15 0.2Proportion of IP Links Broken

Porti

on o

f All

Pairs

Rea

chab

le

Instantaneous IP Tapestry / FRLS

April 22, 2023 ravenben@eecs.berkeley.edu 35

Another Perspective on Reachability

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2

Proportion of IP Links Broken

Prop

ortio

n of

All

Path

s

Portion of all pair-wise paths where

no failure-free paths remain

Portion of all paths where IP and FRLS both

route successfully

A path exists, but neither IP nor

FRLS can locate the path

FRLS finds path, where

short-term IP routing fails

April 22, 2023 ravenben@eecs.berkeley.edu 36

Single Node Software Architecture

SEDA event-driven frameworkJava Virtual Machine

Dynamic Tap.

distance map

core router

application programming interface

applications

Patchwork

network

April 22, 2023 ravenben@eecs.berkeley.edu 37

Related Work Unstructured Peer to Peer Applications

Napster, Gnutella, KaZaa probabilistic search, difficult to scale, inefficient b/w

Structured Peer to Peer Overlays Chord, CAN, Pastry, Kademlia, SkipNet, Viceroy, Symphony, Koorde,

Coral, Ulysseus, … routing efficiency application interface

Resilient routing traffic redirection layers

Detour, Resilient Overlay Networks (RON), Internet Indirection Infrastructure (I3)

our goals: scalability, in-network traffic redirection

April 22, 2023 ravenben@eecs.berkeley.edu 38

Node to Node Routing (PlanetLab)

Ratio of end-to-end latency to ping distance between nodes

All node pairs measured, placed into buckets

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300

Internode RTT Ping time (5ms buckets)

RDP

(min

, med

, 90%

) Median=31.5, 90th percentile=135

April 22, 2023 ravenben@eecs.berkeley.edu 39

Object Location (PlanetLab)

Ratio of end-to-end latency to client-object ping distance Local-area stretch improved w/ additional location state

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160 180 200

Client to Obj RTT Ping time (1ms buckets)

RD

P (m

in, m

edia

n, 9

0%)

90th percentile=158

April 22, 2023 ravenben@eecs.berkeley.edu 40

Micro-benchmark Results (LAN)

0.01

0.1

1

10

100

0.06

0.13

0.25 0.

5 1 2 4 8

16 32 64

128

256

512

1024

2048

Message Size (KB)

Tran

smis

sion

Tim

e (s

)

P-III 1Ghz

P-IV 2.4Ghz

P-III 2.3 Speedup

Per msg overhead ~ 50s, latency dominated by byte copying Performance scales with CPU speedup For 5K messages, throughput = ~10,000 msgs/sec

0

10

20

30

40

50

60

70

80

90

0.06 0.13 0.25 0.5 1 2 4 8 16 32 64 128 256 512 1024 2048Message Size (KB)

Band

wid

th (M

B/s)

P-III 1Ghz LocalP-IV 2.4Ghz LocalP-IV 2.4Ghz 100MBE

100mb/s

April 22, 2023 ravenben@eecs.berkeley.edu 41

B

Traffic TunnelingLegacyNode A

LegacyNode B

ProxyProxy

registerregister

Structured Peer to Peer Overlay

put (hash(B), P’(B))

P’(B)

get (hash(B)) P’(B)

A, B are IP addresses

put (hash(A), P’(A))

Store mapping from end host IP to its proxy’s overlay ID Similar to approach in Internet Indirection Infrastructure (I3)

April 22, 2023 ravenben@eecs.berkeley.edu 42

Constrained Multicast Used only when all paths are below

quality threshold Send duplicate messages on

multiple paths Leverage route convergence

Assign unique message IDs Mark duplicates Keep moving window of IDs Recognize and drop duplicates

Limitations Assumes loss not from congestion Ideal for local area routing

2046

1111

2281 2530

2299 2274 2286

2225

? ? ?

April 22, 2023 ravenben@eecs.berkeley.edu 43

Link Probing Bandwidth (PL)

0

1

2

3

4

5

6

7

1 10 100 1000

Size of Overlay

Ban

dwid

th P

er N

ode

(KB

/s)

PR=300msPR=600ms

Bandwidth increases logarithmically with overlay size Medium sized routing overlays incur low probing bandwidth