Www.dvs1.informatik.tu-darmstadt.de BubbleStorm: Resilient, Probabilistic, and Exhaustive...

Post on 29-Jan-2016

217 views 0 download

Transcript of Www.dvs1.informatik.tu-darmstadt.de BubbleStorm: Resilient, Probabilistic, and Exhaustive...

www.dvs1.informatik.tu-darmstadt.de

BubbleStorm: Resilient, Probabilistic, and Exhaustive Peer-to-Peer Search

Wesley W. Terpstra, Jussi Kangasharju, Christof Leng, Alejandro P. Buchmann

Databases and Distributed Systems Group

Technische Universität Darmstadt

Germany

2

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Classification of P2P Search

Centralized

EDonkey2000

Chord

CAN

Kademlia

P-Grid

Tapestry Pastry

Napster

Structured Unstructured

Gnutella

Random Walks Gia

BubbleStorm

FastTrack

4

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Why unstructured search?

5

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Query processing is independent from routing

This simplifies application development:

Implementing a query language locally distributed implementation “for free”

Reuse existing libraries for query languages SQLite, XPath, Lucene, …

No need to invent a new algorithm per query language

Separation of Concerns

6

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Expressive Searches are Easy

Any selective query can be supported

One operation in unstructured systems can perform a full-text search range restriction on file size hierarchical type selection

Structured systems break queries into small pieces e.g. DHTs must transform the query into key-value Cannot simply compare the cost of operations

7

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Example: Full text search

One op. fetches documents May contact more peers

Perform multiple lookups Transfer word lists (not via DHT) Fetch documents

Keyword 1

Keyword 2

Keyword 3

Doc 1

Doc 2

Doc 2

Doc 1

Unstructured Overlay DHT with inverted index

Latency favours the unstructured approach Relative bandwidth requirements highly parameter dependent

8

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Not everything is uniform

Natural load balancing

P2P applications often handle Zipfian loads Human text has =1 YouTube has =0.5

An unstructured request can be served by any peer Heterogeneity is accommodated by irregular degree In comparison, adding a keyspace creates hot

spots

9

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Example: Zipfian Load

For natural languages (e.g. full text search) in a keyspace: Expected load on most the loaded peer is 7000x average The loaded peer probably has only average capacity

1000

10000

100

10

1

0.1 100000 10000

Node Rank

1000 100 101

Load

com

pare

d t

o load

avera

ge

alpha=1.0

alpha=0.5

alpha=0.0

100,000 peers1 million documents

10

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

BubbleStorm Intuition

Replicate both queries and data copies each (hidden constants unequal)

Data and queries rendezvous in the network€

O n( )

11

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

BubbleStorm: Random Replication

Place data replicas on random nodes Nodes evaluate query replicas on all stored data

Where both data and query go, matches are found

Collisions result from the birthday paradoxData

Query

12

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

BubbleStorm: Exploiting Heterogeneity

Peers have different capacities Faster peers receive more traffic

This is beneficial! Contribution is squared

Data

Query

13

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

BubbleStorm Components

Random Topology Allows efficient sampling of peers at random

Topology Measurement Computes network size and statistics

See PODC’07 brief announcement

Bubblecast Replicates queries/data onto peers quickly

Bubble Maintainer Preserves the correct number of replicas

Covered in this talk

14

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Random Multigraph Topology

Random graphs support the birthday paradox Exploring an edge leads to a randomly sampled peer creation of random node subset (bubble) is cheap

Node degree is chosen proportional to bandwidth As random walks (and bubblecasts) follow edges

with equal probabilityUtilization will be balanced for heterogeneity

15

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Random Multigraph Topology

The topology is a random permutation of its edges It is modified only when peers join or leave

Topology Eulerian Cycle

16

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Join Algorithm

Topology Eulerian Cycle

Joining Node

Contact bootstrapping node Random walk finds a random edge Split the edge and insert in between Multiple joins are executed in parallel or

iteratively

17

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Leave Algorithm

Leave splices two neighboring edges together Join and leave do not change degree of neighbors

Topology Eulerian Cycle

Leaving Node

18

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Bubblecast Motivation

Random Walk

- high latency- unreliable+ precise length+ balanced link load

Bubblecast

+ low latency+ reliable+ precise node count+ balanced link load

Flooding

+ low latency+ reliable- imprecise node count- unbalanced link load

node counter (not hops)fixed branch factor

branch in every step

19

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Example Bubblecast Execution

Decrement the counter for matching locally

17

8 8

4 3 4 3

2 1 1 1 2 1 1 1

1 1

- 1

- 1 - 1

- 1 - 1- 1 - 1

- 1 - 1

Split the counter between two neighbors Counters are always integral

Forwarding terminates when counter reaches 0 Final routing depth differs by at most one hop

A counter specifies the number of replicas to create

20

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Bubblecast Properties

Used for query and data replication

Fixed branch factor balances load Same stationary distribution as a random walk

Counter for edges crossed, not hops Precisely controls replica count

Logarithmic routing depth Slightly deeper than flooding

Message loss reduces replication by log(size)

Samples random nodes… due to random topology

21

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Complexity and Correctness

BubbleStorm costs roughly bandwidth / op.to provide exhaustive search with

The full equation ( ) is complicated by Heterogeneous peer capacity (H) Dependent sampling (due to repeated withdrawals) Unequal query and post traffic (; BS optimizes this)Full details in the paper

c n

P(failure) < e−c 2

c 1 2 3 4

P(success) 63.21% 98.17% 99.99% 99.99999%

e−c 2 +c 3HΥ

22

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Heterogeneous System

Simulation parameters 1 million peers with heterogeneous upstream:

60% 16kB/s, 25% 32kB/s,10% 128kB/s, 5% 1.2MB/s

100B query every 5 user minutes (80/20 injection) 2kB meta data stored every 30 user minutes

Exponential lifetime, mean 60 minutes 10% of leaves are crash failures

Target reliability is 98.2% (c=2)

23

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Apples to Apples Performance: Success

0

0.2

0.4

0.6

0.8

1

100 1000 10000 100000 1e+06

Success Probability

Network size (nodes)

BubbleStormRandom Walk

Gnutella

Search success remains unaffected by increasing the network size

24

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Apples to Apples Performance: Latency

0.1

1

10

100

1000

100 1000 10000 100000 1e+06

Latency (s)

Network size (nodes)

RW PostRW QueryRW MatchGnu QueryGnu Match

BS PostBS QueryBS Match

BS = BubbleStormGnu = GnutellaRW = Ferreira P2P’05

Post = Data replicated Query = Query completed

Match = First hit found

25

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Apples to Apples Performance: Bandwidth

0.01

0.1

1

10

100

1000

10000

100 1000 10000 100000 1e+06

Total System Traffic (MB/s)

Network size (nodes)

Uplink CapacityGnutella

TopologyRandom Walk

Bubblecast

26

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Homogeneous Network: Leave 50%

Give all peers 10kB/s upstream (heterogeneity would help) Set 50% to depart (gracefully) after 1 minute

0.95

0.96

0.97

0.98

0.99

1

27:00 29:00 31:00 33:00 35:00

Probability/Fraction reached

Time (mm:ss)

Unique peersSuccess

27

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Homogeneous Network: Crash 50%

Crash 50% of peers after 1 minute Echo effect: posted data is missing

0

0.2

0.4

0.6

0.8

1

27:00 29:00 31:00 33:00 35:00

Probability/Fraction reached

Time (mm:ss)

Unique peersSuccess

28

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

Current and Future Work

Replica preservation in persistent bubbles Sustain bubble sizes under churn Scale bubble sizes with network size / composition

Update content Non-destructive, versioned updates Delete with death certificates

Release implementation

29

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

BubbleStorm Properties in Recap

Unstructured Queries may be in any language

Heterogeneous Exploits peers with varied capacity

Load Balanced Stationary utilization distribution is flat

Resilient Survives 50% crash fail and 90% leave

Exhaustive All matches of an operation can be retrieved

Probabilistic Success is a tunable guarantee

30

DA

TA

BA

SE

S A

ND

DIS

TR

IBU

TE

D S

YS

TE

MS

TE

CH

NIS

CH

E U

NIV

ER

SIT

ÄT

DA

RM

ST

AD

T

SIGCOMM’07: 1. Motivation 2. Overview 3. Topology 4. Bubblecast 5. Evaluation

When does BubbleStorm fit?

Complex query languages Keyword search and beyond

Zipfian load with large Partitioning data will create an all-pairs sub-problem

Mostly static data Allows us to trade post traffic for search traffic

Highly volatile networks Unstructured topology recovers quickly

www.dvs1.informatik.tu-darmstadt.de

?Questions

Thanks for listening!