UCDavis, ecs251 Spring 2007 05/03/2007P2P1 Operating System Models ecs251 Spring 2007: Operating...

105
05/03/2007 P2P 1 UCDavis, ecs251 Spring 2007 ecs251 Spring 2007: Operating System Operating System Models Models #3: Peer-to-Peer Systems Dr. S. Felix Wu Computer Science Department University of California, Davis http://www.cs.ucdavis.edu/~wu/ [email protected]
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of UCDavis, ecs251 Spring 2007 05/03/2007P2P1 Operating System Models ecs251 Spring 2007: Operating...

05/03/2007 P2P 1

UCDavis, ecs251Spring 2007

ecs251 Spring 2007:

Operating System ModelsOperating System Models#3: Peer-to-Peer Systems

Dr. S. Felix Wu

Computer Science Department

University of California, Davishttp://www.cs.ucdavis.edu/~wu/

[email protected]

05/03/2007 P2P 2

UCDavis, ecs251Spring 2007

The role of service provider..The role of service provider.. Centralized management of services

– DNS, Google, www.cnn.com, Blockbuster, SBC/Sprint/AT&T, cable service, Grid computing, AFS, bank transactions…

Information, Computing, & Network resources owned by one or very few administrative domains.– Some with SLA (Service Level Agreement)

05/03/2007 P2P 3

UCDavis, ecs251Spring 2007

Interacting with the “SP”Interacting with the “SP”

Service providers are the owner of the information and the interactions– Some enhance/establish the interactions

05/03/2007 P2P 4

UCDavis, ecs251Spring 2007

Let’s compare …Let’s compare …

Google Blockbuster CNN MLB/NBA LinkIn e-Bay

Skype Bittorrent Blog Youtube BotNet Cyber-Paparazzi

05/03/2007 P2P 5

UCDavis, ecs251Spring 2007

Toward P2PToward P2P More participation of the end nodes (or

their users)– More decentralized Computing/Network

resources available– End-user controllability and interactions– Security/robustness concerns

05/03/2007 P2P 6

UCDavis, ecs251Spring 2007

Service Providers in P2PService Providers in P2P

We might not like SP, but we still can not avoid SP entirely.– Who is going to lay the fiber and switch?– Can we avoid DNS?– How can we stop “Cyber-Bullying” and other

similar?– Copyright enforcement?– Internet becomes a junkyard?

05/03/2007 P2P 7

UCDavis, ecs251Spring 2007

We will discuss…We will discuss…

P2P system examples– Unstructured, structured, incentive

Architectural analysis and issues Future P2P applications and why?

05/03/2007 P2P 8

UCDavis, ecs251Spring 2007

Challenge to you…Challenge to you…

Define a new P2P-related application, service, or architecture.

Justify why it is practical, useful and will scale well.– Example: sharing cooking recipes, experiences

& recommendations about restaurants and hotels

05/03/2007 P2P 9

UCDavis, ecs251Spring 2007

NapsterNapster

P2P File sharing “Unstructured”

05/03/2007 P2P 10

UCDavis, ecs251Spring 2007 NapsterNapster

Napster serverIndex1. File location

2. List of peers

request

offering the file

peers

3. File request

4. File delivered5. Index update

Napster server

Index

05/03/2007 P2P 11

UCDavis, ecs251Spring 2007

NapsterNapster

Advantages? Disadvantages?

05/03/2007 P2P 12

UCDavis, ecs251Spring 2007

05/03/2007 P2P 13

UCDavis, ecs251Spring 2007

05/03/2007 P2P 14

UCDavis, ecs251Spring 2007

Originally conceived of by Justin Frankel, 21 year old founder of Nullsoft March 2000, Nullsoft posts Gnutella to the web A day later AOL removes Gnutella at the behest of Time Warner The Gnutella protocol version 0.4

http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf

and version 0.6

http://rfc-gnutella.sourceforge.net/Proposals/Ultrapeer/Ultrapeers.htm there are multiple open source implementations at http://sourceforge.net/

including:

– Jtella

– Gnucleus Software released under the Lesser Gnu Public License (LGPL) the Gnutella protocol has been widely analyzed

05/03/2007 P2P 15

UCDavis, ecs251Spring 2007

Gnutella Protocol MessagesGnutella Protocol Messages Broadcast Messages

– Ping: initiating message (“I’m here”)– Query: search pattern and TTL (time-to-live)

Back-Propagated Messages– Pong: reply to a ping, contains information about the peer– Query response: contains information about the

computer that has the needed file Node-to-Node Messages

– GET: return the requested file– PUSH: push the file to me

05/03/2007 P2P 16

UCDavis, ecs251Spring 2007

1

2

3

4

5

6

7A

Steps:• Node 2 initiates search for file A

05/03/2007 P2P 17

UCDavis, ecs251Spring 2007

1

2

3

4

5

6

7

ASteps:• Node 2 initiates search for file A• Sends message to all neighbors

A

A

05/03/2007 P2P 18

UCDavis, ecs251Spring 2007

1

2

3

4

5

6

7

ASteps:• Node 2 initiates search for file A• Sends message to all neighbors• Neighbors forward message

A

A

A

05/03/2007 P2P 19

UCDavis, ecs251Spring 2007

1

2

3

4

5

6

7Steps:• Node 2 initiates search for file A• Sends message to all neighbors• Neighbors forward message• Nodes that have file A initiate a

reply message

A:5

A

A:7

A

A

05/03/2007 P2P 20

UCDavis, ecs251Spring 2007

1

2

3

4

5

6

7Steps:• Node 2 initiates search for file A• Sends message to all neighbors• Neighbors forward message• Nodes that have file A initiate a

reply message• Query reply message is back-

propagated

A:5

A:7

A

A

05/03/2007 P2P 21

UCDavis, ecs251Spring 2007

1

2

3

4

5

6

7Steps:• Node 2 initiates search for file A• Sends message to all neighbors• Neighbors forward message• Nodes that have file A initiate a

reply message• Query reply message is back-

propagated

A:5

A:7

05/03/2007 P2P 22

UCDavis, ecs251Spring 2007

1

2

3

4

5

6

7Steps:• Node 2 initiates search for file A• Sends message to all neighbors• Neighbors forward message• Nodes that have file A initiate a

reply message• Query reply message is back-

propagated• File download

• Note: file transfer between clients behind firewalls is not possible; if only one client, X, is behind a firewall, Y can request that X push the file to Y

download A

Limited Scope FloodingReverse Path Forwarding

05/03/2007 P2P 23

UCDavis, ecs251Spring 2007

GnutellaGnutella

Advantages? Disadvantages?

05/03/2007 P2P 24

UCDavis, ecs251Spring 2007

GUID: Short for Global Unique Identifier, a randomized string that is used to uniquely identify a host or message on the Gnutella Network. This prevents duplicate messages from being sent on the network.

GWebCache: a distributed system for helping servants connect to the Gnutella network, thus solving the "bootstrapping" problem. Servants query any of several hundred GWebCache servers to find the addresses of other servants. GWebCache servers are typically web servers running a special module.

Host Catcher: Pong responses allow servants to keep track of active gnutella hosts

On most servants, the default port for Gnutella is 6346

05/03/2007 P2P 25

UCDavis, ecs251Spring 2007

Gnutella Network Growth .

-

10

20

30

40

50

11/20/0011/21/0011/25/0011/28/0002/27/0103/01/0103/05/0103/09/0103/13/0103/16/0103/19/0103/22/0103/24/0105/12/0105/16/0105/22/0105/24/0105/29/01

Number of nodes in the largest network component ('000)

05/03/2007 P2P 26

UCDavis, ecs251Spring 2007 ““Limited Scope Flooding”Limited Scope Flooding”

Ripeanu reported that Gnutella traffic totals 1Gbps (or 330TB/month).– Compare to 15,000TB/month in US Internet backbone

(December 2000)– this estimate excludes actual file transfers

Reasoning: QUERY and PING messages are flooded. They form

more than 90% of generated traffic predominant TTL=7 >95% of nodes are less than 7 hops away measured traffic at each link about 6kbs network with 50k nodes and 170k links

05/03/2007 P2P 27

UCDavis, ecs251Spring 2007

A

DB

C

E

H

G

F

Perfect Mapping

05/03/2007 P2P 28

UCDavis, ecs251Spring 2007

A

DB

C

E

H

G

F

Inefficient mapping Link D-E needs to support six times higher

traffic.

05/03/2007 P2P 29

UCDavis, ecs251Spring 2007

Topology mismatchTopology mismatchThe overlay network topology doesn’t match the

underlying Internet infrastructure topology!

40% of all nodes are in the 10 largest Autonomous Systems (AS)

Only 2-4% of all TCP connections link nodes within the same AS

Largely ‘random wiring’ Most Gnutella generated traffic crosses AS border,

making the traffic more expensive May cause ISPs to change their pricing scheme

05/03/2007 P2P 30

UCDavis, ecs251Spring 2007

ScalabilityScalability

Whenever a node receives a message, (ping/query) it sends copies out to all of its other connections.

existing mechanisms to reduce traffic:– TTL counter– Cache information about messages they

received, so that they don't forward duplicated messages.

05/03/2007 P2P 31

UCDavis, ecs251Spring 2007

70% of Gnutella users share no files 90% of users answer no queries Those who have files to share may limit number of connections or

upload speed, resulting in a high download failure rate. If only a few individuals contribute to the public good, these few

peers effectively act as centralized servers.

05/03/2007 P2P 32

UCDavis, ecs251Spring 2007

Anonymity Gnutella provides for anonymity by

masking the identity of the peer that generated a query.

However, IP addresses are revealed at various points in its operation: HITS packets includes the URL for each file, revealing the IP addresses

05/03/2007 P2P 33

UCDavis, ecs251Spring 2007

Query Expressiveness

Format of query not standardized No standard format or matching semantics for the

QUERY string. Its interpretation is completely determined by each node that receives it.

String literal vs. regular expression Directory name, filename, or file contents Malicious users may even return files unrelated to

the query

05/03/2007 P2P 34

UCDavis, ecs251Spring 2007

SuperpeersSuperpeers

Cooperative, long-lived peers typically with significant resources to handle very high amount of query resolution traffic.

05/03/2007 P2P 35

UCDavis, ecs251Spring 2007

05/03/2007 P2P 36

UCDavis, ecs251Spring 2007

Gnutella is a self-organizing, large-scale, P2P application that produces an overlay network on top of the Internet; it appears to work

Growth is hindered by the volume of generated traffic and inefficient resource use

since there is no central authority the open source community must commit to making any changes

Suggested changes have been made by– Peer-to-Peer Architecture Case Study: Gnutella Network, by Matei

Ripeanu– Improving Gnutella Protocol: Protocol Analysis and Research

Proposals by Igor Ivkovic

05/03/2007 P2P 37

UCDavis, ecs251Spring 2007

FreenetFreenet

Essentially the same as Gnutella:– Limited-scope flooding– Reverse-path forwarding

Difference:– Data objects (I.e., files) are also being delivered

via “reverse-path forwarding”

05/03/2007 P2P 38

UCDavis, ecs251Spring 2007

P2P IssuesP2P Issues

Scalability & Load Balancing Anonymity Fairness, Incentives & Trust Security and Robustness Efficiency Mobility

05/03/2007 P2P 39

UCDavis, ecs251Spring 2007

Incentive-driven FairnessIncentive-driven Fairness

P2P means we all should contribute..– Hopefully fair, but the majority is selfish…

“Incentive for people to contribute…”

05/03/2007 P2P 40

UCDavis, ecs251Spring 2007

Bittorrent: “Tit for Tat”Bittorrent: “Tit for Tat”

Equivalent Retaliation (Game theory)– A peer will “initially” cooperate, then respond

in kind to an opponent's previous action. If the opponent previously was cooperative, the agent is cooperative. If not, the agent is not.

05/03/2007 P2P 41

UCDavis, ecs251Spring 2007

BittorrentBittorrent Fairness of download and upload between a

pair of peers Every 10 seconds, estimate the download

bandwidth from the other peer– Based on the performance estimation to decide

to continue uploading to the other peer or not

05/03/2007 P2P 42

UCDavis, ecs251Spring 2007

Client & its PeersClient & its Peers

Client– Download rate (from the peers)

Peers– Upload rate (to the client)

05/03/2007 P2P 43

UCDavis, ecs251Spring 2007

BT Choking by ClientBT Choking by Client By default, every peer is “choked”

– stop “uploading” to them, but the TCP connection is still there.

Select four peers to “unchoke”– Best “upload rates” and “interested”.– Uploading to the unchoked ones and monitor the

download rate for all the peers– “Re-choke” every 30 seconds

Optimistic Unchoking– Randomly select a choked peer to unchoke

05/03/2007 P2P 44

UCDavis, ecs251Spring 2007

““Interested”Interested”

A request for a piece (or its sub-pieces)

05/03/2007 P2P 45

UCDavis, ecs251Spring 2007

Becoming “seed”Becoming “seed”

Use “upload” rate to the peers to decide which peers to unchoke.

05/03/2007 P2P 46

UCDavis, ecs251Spring 2007

Bittorrent WikiBittorrent Wiki

05/03/2007 P2P 47

UCDavis, ecs251Spring 2007

BT Peer SelectionBT Peer Selection

From the “Tracker”– We receive a partial list of all active peers for

the same file– We can get another 50 from the tracker if we

want

05/03/2007 P2P 48

UCDavis, ecs251Spring 2007

Piece SelectionPiece Selection Piece (64K~1M) Sub-piece (16K)

– Piece-size: trade-off between performance and the size of the torrent file itself

– A client might request different sub-pieces of the same piece from different peers.

Strict Priority - sub-pieces and piece Rarest First

– Exception: “random first”

– Get the stuff out of Seed(s) as soon as possible..

05/03/2007 P2P 49

UCDavis, ecs251Spring 2007

Rarest FirstRarest First

Exchanging bitmaps with 20+ peers– Initial messages– “have” messages

Array of buckets– Ith buckets contains “pieces” with I known

instances– Within the same bucket, the client will

randomly select one piece.

05/03/2007 P2P 50

UCDavis, ecs251Spring 2007

Random-FirstRandom-First

Usually, rare-first pieces are rare. The client has to get all the sub-pieces from

one or very few peers. For the first 4~5 pieces, get some random

pieces so the client can have a few pieces to upload.

05/03/2007 P2P 51

UCDavis, ecs251Spring 2007

BitTorrentBitTorrent

Connect to the Tracker Connect to 20+ peers Random-first or Rarest-first Monitoring the download rate from the

peers (or upload rate to the client) Unchoke and Optimistic Unchoke

05/03/2007 P2P 52

UCDavis, ecs251Spring 2007

BittorrentBittorrent

Advantages Disadvantages

05/03/2007 P2P 53

UCDavis, ecs251Spring 2007

Trackerless BittorrentTrackerless Bittorrent

Every BT peer is a tracker! But, how would they share and exchange

information regarding other peers? Similar to Napster’s index server or DNS

05/03/2007 P2P 54

UCDavis, ecs251Spring 2007

Pure P2PPure P2P

Every peer is a tracker Every peer is a DNS server Every peer is a Napster Index server

How can this be done?– We try to remove/reduce the role of “special

servers”!

05/03/2007 P2P 55

UCDavis, ecs251Spring 2007

PeerPeer

The requirements of Peer?

05/03/2007 P2P 56

UCDavis, ecs251Spring 2007

Structured PeeringStructured Peering

Peer identity and routability

05/03/2007 P2P 57

UCDavis, ecs251Spring 2007

Structured PeeringStructured Peering

Peer identity and routability Key/content assignment

– Which identity owns what? (Google Search?)

05/03/2007 P2P 58

UCDavis, ecs251Spring 2007

Structured PeeringStructured Peering

Peer identity and routability Key/content assignment

– Which identity owns what?Napster: centralized index serviceSkype/Kazaa: login-server & super peersDNS: hierarchical DNS servers

Two problems:(1). How to connect to the “ring”?(2). How to prevent failures/changes?

05/03/2007 P2P 59

UCDavis, ecs251Spring 2007

DHTDHT

Distributed hash tables (DHTs)– decentralized lookup service of a hash table– (name, value) pairs stored in the DHT– any peer can efficiently retrieve the value

associated with a given name– the mapping from names to values is distributed

among peers

05/03/2007 P2P 60

UCDavis, ecs251Spring 2007

HT as a search tableHT as a search table

Index key

Information/content is distributed, and we need to know where?

Where is this piece of music?What is the location of this type of content?What is the current IP address of this skype user?

05/03/2007 P2P 61

UCDavis, ecs251Spring 2007

DHT as a search tableDHT as a search table

Index key

???

05/03/2007 P2P 62

UCDavis, ecs251Spring 2007

DHT as a search tableDHT as a search table

Index key

???

05/03/2007 P2P 63

UCDavis, ecs251Spring 2007

DHT as a search tableDHT as a search table

Index key

???

05/03/2007 P2P 64

UCDavis, ecs251Spring 2007

DHTDHT

Scalable Peer arrivals, departures, and failures Unstructured versus structured

05/03/2007 P2P 65

UCDavis, ecs251Spring 2007

DHT (Name, Value)DHT (Name, Value)

How to utilize DHT to avoid Trackers in Bittorrent?

05/03/2007 P2P 66

UCDavis, ecs251Spring 2007

DHT-based TrackerDHT-based Tracker

Index key

Whoever owns this hash entry is the tracker for the corresponding key!

FreeBSD 5.4 CD images

Publish the key on the class web site.

Seed’s IP address

PUT & GET

05/03/2007 P2P 67

UCDavis, ecs251Spring 2007

ChordChord

Consistent Hashing A Simple Key Lookup Algorithm Scalable Key Lookup Algorithm Node Joins and Stabilization Node Failures

05/03/2007 P2P 68

UCDavis, ecs251Spring 2007

ChordChord

Given a key (data item), it maps the key onto a peer.

Uses consistent hashing to assign keys to peers.

Solves problem of locating key in a collection of distributed peers.

Maintains routing information as peers join and leave the system

05/03/2007 P2P 69

UCDavis, ecs251Spring 2007

IssuesIssues

Load balance: distributed hash function, spreading keys evenly over peers

Decentralization: chord is fully distributed, no node more important than other, improves robustness

Scalability: logarithmic growth of lookup costs with number of peers in network, even very large systems are feasible

Availability: chord automatically adjusts its internal tables to ensure that the peer responsible for a key can always be found

05/03/2007 P2P 70

UCDavis, ecs251Spring 2007

Example ApplicatioExample Applicationn

Highest layer provides a file-like interface to user including user-friendly naming and authentication

This file systems maps operations to lower-level block operations

Block storage uses Chord to identify responsible node for storing a block and then talk to the block storage server on that node

File System

Block Store

Chord

Block Store

Chord

Block Store

Chord

Client Server Server

05/03/2007 P2P 71

UCDavis, ecs251Spring 2007

Consistent HashingConsistent Hashing Consistent hash function assigns each peer and

key an m-bit identifier. SHA-1 is used as a base hash function. A peer’s identifier is defined by hashing the peer’s

IP address. A key identifier is produced by hashing the key

(chord doesn’t define this. Depends on the application).– ID(peer) = hash(IP, Port)

– ID(key) = hash(key)

05/03/2007 P2P 72

UCDavis, ecs251Spring 2007

Consistent HashingConsistent Hashing In an m-bit identifier space, there are 2m identifiers. Identifiers are ordered on an identifier circle

modulo 2m. The identifier ring is called Chord ring. Key k is assigned to the first peer whose identifier

is equal to or follows (the identifier of) k in the identifier space.

This peer is the successor peer of key k, denoted by successor(k).

05/03/2007 P2P 73

UCDavis, ecs251Spring 2007

6

1

2

6

0

4

26

5

1

3

7

2identifier

circle

identifier

node

X key

Consistent HashingConsistent Hashing - - Successor Successor PeersPeers

successor(1) = 1

successor(2) = 3successor(6) = 0

05/03/2007 P2P 75

UCDavis, ecs251Spring 2007

Consistent HashingConsistent Hashing – Join and – Join and DepartureDeparture

When a node n joins the network, certain keys previously assigned to n’s successor now become assigned to n.

When node n leaves the network, all of its assigned keys are reassigned to n’s successor.

05/03/2007 P2P 76

UCDavis, ecs251Spring 2007

Node JoinNode Join

0

4

26

5

1

3

7

keys1

keys2

keys

keys

7

5

05/03/2007 P2P 77

UCDavis, ecs251Spring 2007

Node DepartureNode Departure

0

4

26

5

1

3

7

keys1

keys2

keys

keys6

7

05/03/2007 P2P 78

UCDavis, ecs251Spring 2007

Technical IssuesTechnical Issues

???

05/03/2007 P2P 80

UCDavis, ecs251Spring 2007

A Simple Key LookupA Simple Key Lookup

A very small amount of routing information suffices to implement consistent hashing in a distributed environment

If each node knows only how to contact its current successor node on the identifier circle, all node can be visited in linear order.

Queries for a given identifier could be passed around the circle via these successor pointers until they encounter the node that contains the key.

05/03/2007 P2P 81

UCDavis, ecs251Spring 2007

A Simple Key LookupA Simple Key Lookup

Pseudo code for finding successor:// ask node n to find the successor of id

n.find_successor(id)

if (id (n, successor])

return successor;

else

// forward the query around the circle

return successor.find_successor(id);

05/03/2007 P2P 82

UCDavis, ecs251Spring 2007

A Simple Key LookupA Simple Key Lookup The path taken by a query from node 8 for

key 54:

05/03/2007 P2P 83

UCDavis, ecs251Spring 2007

SuccessorSuccessor

Each active node MUST know the IP address of its successor!– N8 has to know that the next node on the ring is

N14. Departure N8 => N21 But, how about failure or crash?

05/03/2007 P2P 84

UCDavis, ecs251Spring 2007

RobustnessRobustness

Successor in R hops– N8 => N14, N21, N32, N38 (R=4)– Periodic pinging along the path to check, &

also find out maybe there are “new members” in between

05/03/2007 P2P 85

UCDavis, ecs251Spring 2007

Is that good enough?Is that good enough?

05/03/2007 P2P 86

UCDavis, ecs251Spring 2007

Complexity of the searchComplexity of the search

Time/messages: O(N)– N: # of nodes on the Ring

Space: O(1)– We only need to remember R IP addresses

Stablization depends on “period”.

05/03/2007 P2P 87

UCDavis, ecs251Spring 2007

Scalable Key LocationScalable Key Location

To accelerate lookups, Chord maintains additional routing information.

This additional information is not essential for correctness, which is achieved as long as each node knows its correct successor.

05/03/2007 P2P 88

UCDavis, ecs251Spring 2007

Scalable Key Location – Finger Scalable Key Location – Finger TablesTables

Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table.

The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least 2i-1 on the identifier circle.

s = successor(n+2i-1).

s is called the ith finger of node n, denoted by n.finger(i)

05/03/2007 P2P 89

UCDavis, ecs251Spring 2007

Scalable Key Location – Scalable Key Location – Finger Finger TablesTables

0

4

26

5

1

3

7

124

130

finger tablestart succ.

keys1

235

330

finger tablestart succ.

keys2

457

000

finger tablestart succ.

keys6

0+20

0+21

0+22

For.

1+20

1+21

1+22

For.

3+20

3+21

3+22

For.

05/03/2007 P2P 90

UCDavis, ecs251Spring 2007

Finger TablesFinger Tables

A finger table entry includes both the Chord identifier and the IP address (and port number) of the relevant node.

The first finger of n is the immediate successor of n on the circle.

05/03/2007 P2P 91

UCDavis, ecs251Spring 2007

Scalable Key Location – Example Scalable Key Location – Example queryquery

The path a query for key 54 starting at node 8:

05/03/2007 P2P 92

UCDavis, ecs251Spring 2007

Scalable Key Location – A Scalable Key Location – A characteristiccharacteristic

Since each node has finger entries at power of two intervals around the identifier circle, each node can forward a query at least halfway along the remaining distance between the node and the target identifier. From this intuition follows a theorem:

Theorem: With high probability, the number of nodes that must be contacted to find a successor in an N-node network is O(logN).

05/03/2007 P2P 93

UCDavis, ecs251Spring 2007

Complexity of the SearchComplexity of the Search

Time/messages: O(logN)– N: # of nodes on the Ring

Space: O(logN)– We need to remember R IP addresses– We need to remember logN Fingers

Stablization depends on “period”.

05/03/2007 P2P 94

UCDavis, ecs251Spring 2007 An ExampleAn Example

M = 4096 (identifier size), ring size is 24096

N = 216 (# of nodes) How many entries we need to have for the

Finger Table?

Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table.The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least 2i-1 on the identifier circle.

s = successor(n+2i-1).

05/03/2007 P2P 95

UCDavis, ecs251Spring 2007

Complexity of the SearchComplexity of the Search

Time/messages: O(M)– M: # of bits of the identifier

Space: O(M)– We need to remember R IP addresses– We need to remember M Fingers

Stablization depends on “period”.

05/03/2007 P2P 96

UCDavis, ecs251Spring 2007

Structured PeeringStructured Peering

Peer identity and routability– 2M identifiers, Finger Table routing

Key/content assignment– Hashing

Dynamics/Failures– Inconsistency??

05/03/2007 P2P 97

UCDavis, ecs251Spring 2007

Node Joins and Node Joins and StabilizationsStabilizations

The most important thing is the successor pointer. If the successor pointer is ensured to be up to date,

which is sufficient to guarantee correctness of lookups, then finger table can always be verified.

Each node runs a “stabilization” protocol periodically in the background to update successor pointer and finger table.

05/03/2007 P2P 98

UCDavis, ecs251Spring 2007

Node Joins and Node Joins and StabilizationsStabilizations

“Stabilization” protocol contains 6 functions:– create( )– join( )– stabilize( )– notify( )– fix_fingers( )– check_predecessor( )

05/03/2007 P2P 99

UCDavis, ecs251Spring 2007

Node Joins – join()Node Joins – join()

When node n first starts, it calls n.join(n’), where n’ is any known Chord node.

The join() function asks n’ to find the immediate successor of n.

join() does not make the rest of the network aware of n.

05/03/2007 P2P 100

UCDavis, ecs251Spring 2007

Node Joins – join()Node Joins – join()

// create a new Chord ring.n.create()

predecessor = nil;successor = n;

// join a Chord ring containing node n’.n.join(n’)

predecessor = nil;successor = n’.find_successor(n);

05/03/2007 P2P 102

UCDavis, ecs251Spring 2007

Node Joins – stabilize()Node Joins – stabilize()

Each time node n runs stabilize(), it asks its successor for the it’s predecessor p, and decides whether p should be n’s successor instead.

stabilize() notifies node n’s successor of n’s existence, giving the successor the chance to change its predecessor to n.

The successor does this only if it knows of no closer predecessor than n.

05/03/2007 P2P 103

UCDavis, ecs251Spring 2007

Node Joins – stabilize()Node Joins – stabilize()

// called periodically. verifies n’s immediate// successor, and tells the successor about n.n.stabilize()

x = successor.predecessor;if (x (n, successor))

successor = x;successor.notify(n);

// n’ thinks it might be our predecessor.n.notify(n’)if (predecessor is nil or n’ (predecessor, n))

predecessor = n’;

05/03/2007 P2P 104

UCDavis, ecs251Spring 2007

Node JoinsNode Joins – Join and – Join and StabilizatioStabilizationn

np

su

cc(n

p)

= n

s

ns

n

pre

d(n

s)

= n

p n joins

– predecessor = nil

– n acquires ns as successor via some n’

n runs stabilize

– n notifies ns being the new predecessor

– ns acquires n as its predecessor

np runs stabilize

– np asks ns for its predecessor (now n)

– np acquires n as its successor

– np notifies n

– n will acquire np as its predecessor

all predecessor and successor pointers are now correct

fingers still need to be fixed, but old fingers will still work

nil

pre

d(n

s)

= n

su

cc(n

p)

= n

05/03/2007 P2P 105

UCDavis, ecs251Spring 2007

Node Joins – fix_fingers()Node Joins – fix_fingers()

Each node periodically calls fix fingers to make sure its finger table entries are correct.

It is how new nodes initialize their finger tables

It is how existing nodes incorporate new nodes into their finger tables.

05/03/2007 P2P 106

UCDavis, ecs251Spring 2007

Node Joins – fix_fingers()Node Joins – fix_fingers()

// called periodically. refreshes finger table entries.n.fix_fingers()

next = next + 1 ;if (next > m)

next = 1 ;finger[next] = find_successor(n + 2next-1);

// checks whether predecessor has failed.n.check_predecessor()

if (predecessor has failed)predecessor = nil;

05/03/2007 P2P 108

UCDavis, ecs251Spring 2007

Node Node FailureFailuress

Key step in failure recovery is maintaining correct successor pointers

To help achieve this, each node maintains a successor-list of its r nearest successors on the ring

If node n notices that its successor has failed, it replaces it with the first live entry in the list

Successor lists are stabilized as follows: – node n reconciles its list with its successor s by copying s’s successor list,

removing its last entry, and prepending s to it. – If node n notices that its successor has failed, it replaces it with the first

live entry in its successor list and reconciles its successor list with its new successor.

05/03/2007 P2P 109

UCDavis, ecs251Spring 2007

Chord – The MathChord – The Math

Every node is responsible for about K/N keys (N nodes, K keys)

When a node joins or leaves an N-node network, only O(K/N) keys change hands (and only to and from joining or leaving node)

Lookups need O(log N) messages

To reestablish routing invariants and finger tables after node joining or leaving, only O(log2N) messages are required