04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr....

130
04/27/2011 DHT 1 ecs251 Spring 2011: Operating System Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University of California, Davis http://www.facebook.com/group.php?gid=296 70204725 http://cyrus.cs.ucdavis.edu/~wu/ecs251

Transcript of 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr....

Page 1: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 1

ecs251 Spring 2011:Operating SystemOperating System#5: Distributed Hash Table

Dr. S. Felix Wu

Computer Science Department

University of California, Davis

http://www.facebook.com/group.php?gid=29670204725

http://cyrus.cs.ucdavis.edu/~wu/ecs251

Page 2: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 2

GFS: Google File SystemGFS: Google File System

“failures” are norm Multiple-GB files are common Append rather than overwrite

– Random writes are rare Can we relax the consistency?

Page 3: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 3

• Client translates file name and byte offset to chunk index.• Sends request to master.• Master replies with chunk handle and location of replicas.• Client caches this info.• Sends request to a close replica, specifying chunk handle and byte range.• Requests to master are typically buffered.

Page 4: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 4

The MasterThe MasterMaintains all file system metadata.

names space, access control info, file to chunk mappings, chunk (including replicas) location, etc.

Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state

Page 5: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 5

1. Client asks master for all replicas.2. Master replies. Client caches.3. Client pre-pushes data to all

replicas.4. After all replicas acknowledge,

client sends write request to primary.

5. Primary forwards write request to all replicas.

6. Secondaries signal completion.7. Primary replies to client. Errors

handled by retrying.

Page 6: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

System InteractionsSystem Interactions

The master grants a chunk lease to a replica The replica holding the lease determines the

order of updates to all replicas Lease

– 60 second timeouts– Can be extended indefinitely– Extension request are piggybacked on heartbeat

messages– After a timeout expires, the master can grant new

leases

04/27/2011 6DHT

Page 7: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

SnapshotSnapshot

A “snapshot” is a copy of a system at a moment in time.– When are snapshots useful?

– Does “cp –r” generate snapshots?

Handled using copy-on-write (COW).– First revoke all leases.

– Then duplicate the metadata, but point to the same chunks.

– When a client requests a write, the master allocates a new chunk handle.

04/27/2011 7DHT

Page 8: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 8

SecondaryNameNode

Client

HDFS Architecture

NameNode

DataNodes

1. filename

2. BlckId, DataNodes

o

3.Read data

Cluster Membership

Cluster Membership

NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on diskSecondaryNameNode: Periodic merge of Transaction log

Page 9: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 9

Structured PeeringStructured Peering

Peer identity and routability Key/content assignment

– Which identity owns what?GFS/Napster: centralized index serviceSkype/Kazaa: login-server & super peersDNS: hierarchical DNS servers

Two problems:(1). How to connect to the “topology”?(2). How to prevent failures/changes?

Page 10: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 10

DHTDHT

Most s-P2P systems are DHT-based. Distributed hash tables (DHTs)

– decentralized lookup service of a hash table– (name, value) pairs stored in the DHT– any peer can efficiently retrieve the value

associated with a given name– the mapping from names to values is distributed

among peers

Page 11: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 11

HT as a search tableHT as a search table(BitTorrent, Napster)(BitTorrent, Napster)

Index key

Information/content is distributed, and we need to know where?

Where is this GFS chunk?Where is this piece of music?Is this BT piece available?What is the location of this type of content?What is the current IP address of this skype user?

Content Object/Peer naming

“160 bits”

Page 12: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 12

DHT as a search tableDHT as a search table

Index key

???

Page 13: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 13

DHT as a search tableDHT as a search table

Index key

???

Page 14: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 14

DHT segment DHT segment ownershipownership

Index key

???

Page 15: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 15

DHTDHT

Scalable Peer arrivals, departures, and failures Unstructured versus structured

Page 16: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 16

DHT (Name, Value)DHT (Name, Value)

How to utilize DHT to avoid Trackers in Bittorrent?

Page 17: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 17

DHT-based TrackerDHT-based Tracker

Index key

Whoever owns this hash entry is the tracker for the corresponding key!

FreeBSD 5.4 CD images

Publish the key on the class web site.

Seed’s IP address

PUT & GET

Page 18: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 18

ChordChord

Given a key (content object), it maps the key onto a peer -- consistent hash

Assign keys to peers. Solves problem of locating key in a

collection of distributed peers. Maintains routing information as peers join

and leave the system

Page 19: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 19

ChordChord

Consistent Hashing A Simple Key Lookup Algorithm Scalable Key Lookup Algorithm Node Joins and Stabilization Node Failures

Page 20: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 20

Consistent HashingConsistent Hashing Consistent hash function assigns each peer and key

an m-bit identifier (e.g., 140 bits). SHA-1 as a base hash function. A peer’s identifier is defined by hashing the peer’s

IP address. (other possibilities?) A content identifier is produced by hashing the key:

– ID(peer) = SHA-1(IP, Port)– ID(content) = SHA-1(related to the content object)

– Application-dependent!

Page 21: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 21

Peer, ContentPeer, Content

In an m-bit identifier space, there are 2m identifiers (for both peer and content).

Which peer handles which content?

Page 22: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 22

Peer, ContentPeer, Content In an m-bit identifier space, there are 2m

identifiers (for both peer and content). Which peer handles which contents?

– We will not have 2m peers/contents!– Each peer might need to handle more than one

contents.– In that case, which peer has what?

Page 23: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 23

Consistent HashingConsistent Hashing In an m-bit identifier space, there are 2m

identifiers. an identifier circle modulo 2m. The identifier ring is called Chord ring. Content X is assigned to the first peer whose

identifier is equal to or follows (the identifier of) X in the identifier space.

This peer is the successor peer of key X, denoted by successor(X).

Page 24: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 24

6

1

2

6

0

4

26

5

1

3

7

2identifier

circle

identifier

node

X key

Successor PeersSuccessor Peers

successor(1) = 1

successor(2) = 3successor(6) = 0

Page 25: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 26

Join and DepartureJoin and Departure

When a node N joins the network, certain contents previously assigned to N’s successor now become assigned to N.

When node N leaves the network, all of its assigned contents are reassigned to N’s successor.

Page 26: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 27

JoinJoin

0

4

26

5

1

3

7

keys1

keys2

keys

keys

7

5

Page 27: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 28

DepartureDeparture

0

4

26

5

1

3

7

keys1

keys2

keys

keys6

7

Page 28: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 29

Join/DepartJoin/Depart

What information must be maintained?

Page 29: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 30

Join/DepartJoin/Depart

What information must be maintained?– Pointer to successor(s)– Content itself (but application dependent)

Page 30: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 31

Tracker gone?Tracker gone?

Index key

Whoever owns this hash entry is the tracker for the corresponding key!

FreeBSD 5.4 CD images

Publish the key on the class web site.

Seed’s IP address

PUT & GET

Page 31: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 32

How to identify the How to identify the tracker?tracker?

And, its IP address, of course?

Page 32: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 33

A Simple Key LookupA Simple Key Lookup

A very small amount of routing information suffices to implement consistent hashing in a distributed environment

If each node knows only how to contact its current successor node on the identifier circle, all node can be visited in linear order.

Queries for a given identifier could be passed around the circle via these successor pointers until they encounter the node that contains the key.

Page 33: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 34

A Simple Key LookupA Simple Key Lookup

Pseudo code for finding successor:// ask node n to find the successor of id

N.find_successor(id)

if (id (N, successor])

return successor;

else

// forward the query around the circle

return successor.find_successor(id);

Page 34: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 35

A Simple Key LookupA Simple Key Lookup The path taken by a query from node 8 for

key 54:

Page 35: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 36

SuccessorSuccessor

Each active node MUST know the IP address of its successor!– N8 has to know that the next node on the ring is

N14. Departure N8 => N21 But, how about failure or crash?

Page 36: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 37

RobustnessRobustness

Successor in R hops– N8 => N14, N21, N32, N38 (R=4)– Periodic pinging along the path to check, &

also find out maybe there are “new members” in between

Page 37: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 38

Is that good enough?Is that good enough?

Page 38: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 39

Without Periodic Ping…??Without Periodic Ping…??Triggered only by dynamics (Join/Depart)!

Page 39: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 40

Complexity of the Complexity of the searchsearch

Time/messages: O(N)– N: # of nodes on the Ring

Space: O(1)– We only need to remember R IP addresses

Stablization depends on “period”.

Page 40: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 41

Scalable Key LocationScalable Key Location

To accelerate lookups, Chord maintains additional routing information.

This additional information is not essential for correctness, which is achieved as long as each node knows its correct successor.

Page 41: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 42

Finger TablesFinger Tables

Each node N’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table.

The ith entry in the table at node N contains the identity of the first node s that succeeds N by at least 2i-1 on the identifier circle.

s = successor (n+2i-1).

s is called the ith finger of node N, denoted by N.finger(i)

Page 42: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 43

Finger TablesFinger Tables

0

4

26

5

1

3

7

124

130

finger tablestart succ.

keys1

235

330

finger tablestart succ.

keys2

457

000

finger tablestart succ.

keys6

0+20

0+21

0+22

For.

1+20

1+21

1+22

For.

3+20

3+21

3+22

For.

s = successor (n+2i-1).

Page 43: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 44

Finger TablesFinger Tables

A finger table entry includes both the Chord identifier and the IP address (and port number) of the relevant node.

The first finger of N is the immediate successor of N on the circle.

Page 44: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 45

Example queryExample query

The path a query for key 54 starting at node 8:

Page 45: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

Kademlia routingKademlia routing

04/27/2011 DHT 46

Page 46: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 47

Scalable Key LocationScalable Key Location

Since each node has finger entries at power of two intervals around the identifier circle, each node can forward a query at least halfway along the remaining distance between the node and the target identifier. From this intuition follows a theorem:

Theorem: With high probability, the number of nodes that must be contacted to find a successor in an N-node network is O(logN).

Page 47: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 48

Complexity of the Complexity of the SearchSearch

Time/messages: O(logN)– N: # of nodes on the Ring

Space: O(logN)– We need to remember R IP addresses– We need to remember logN Fingers

Stablization depends on “period”.

Page 48: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 49

An ExampleAn Example M = 140 (identifier size), ring size is 2140

N = 216 (# of nodes) How many entries we need to have for the

Finger Table?

Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table.The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least 2i-1 on the identifier circle.

s = successor(n+2i-1).

Page 49: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 50

Complexity of the Complexity of the SearchSearch

Time/messages: O(M)– M: # of bits of the identifier

Space: O(M)– We need to remember R IP addresses– We need to remember M Fingers

Stablization depends on “period”.

Page 50: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 51

Structured PeeringStructured Peering

Peer identity and routability– 2M identifiers, Finger Table routing

Key/content assignment– Hashing

Dynamics/Failures– Inconsistency??

Page 51: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 52

Joins and StabilizationsJoins and Stabilizations

The most important thing is the successor pointer. If the successor pointer is ensured to be up to date,

which is sufficient to guarantee correctness of lookups, then finger table can always be verified.

Each node runs a “stabilization” protocol periodically in the background to update successor pointer and finger table.

Page 52: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 53

Node Joins – stabilize()Node Joins – stabilize()

Each time node N runs stabilize(), it asks its successor for the it’s predecessor p, and decides whether p should be N’s successor instead.

stabilize() notifies node N’s successor of N’s existence, giving the successor the chance to change its predecessor to N.

The successor does this only if it knows of no closer predecessor than N.

Page 53: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 54

Node Joins – stabilize()Node Joins – stabilize()// called periodically. verifies N’s immediate// successor, and tells the successor about N.N.stabilize()

x = successor.predecessor;if (x (N, successor))

successor = x;successor.notify(N);

// N’ thinks it might be our predecessor.n.notify(N’)if (predecessor is nil or N’ (predecessor, N))

predecessor = N’;

Page 54: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 55

StabilizatioStabilizationn

np

su

cc(n

p)

= n

s

ns

n

pre

d(n

s)

= n

p

n joins

– predecessor = nil

– n acquires ns as successor via some n’

n runs stabilize

– n notifies ns being the new predecessor

– ns acquires n as its predecessor

np runs stabilize

– np asks ns for its predecessor (now n)

– np acquires n as its successor

– np notifies n

– n will acquire np as its predecessor

all predecessor and successor pointers are now correct

fingers still need to be fixed, but old fingers will still work

nil

pre

d(n

s)

= n

su

cc(n

p)

= n

Page 55: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 56

fix_fingers()fix_fingers()

Each node periodically calls fix fingers to make sure its finger table entries are correct.

It is how new nodes initialize their finger tables

It is how existing nodes incorporate new nodes into their finger tables.

Page 56: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 57

Node Joins – Node Joins – fix_fingers()fix_fingers()

// called periodically. refreshes finger table entries.N.fix_fingers()

next = next + 1 ;if (next > m)

next = 1 ;finger[next] = find_successor(N + 2next-1);

// checks whether predecessor has failed.n.check_predecessor()

if (predecessor has failed)predecessor = nil;

Page 57: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 59

Node Node FailureFailuress

Key step in failure recovery is maintaining correct successor pointers

To help achieve this, each node maintains a successor-list of its r nearest successors on the ring

If node n notices that its successor has failed, it replaces it with the first live entry in the list

Successor lists are stabilized as follows: – node n reconciles its list with its successor s by copying s’s successor list,

removing its last entry, and prepending s to it. – If node n notices that its successor has failed, it replaces it with the first

live entry in its successor list and reconciles its successor list with its new successor.

Page 58: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

04/27/2011 DHT 60

Chord – The MathChord – The Math Every node is responsible for about K/N keys (N nodes, K

keys)

When a node joins or leaves an N-node network, only O(K/N) keys change hands (and only to and from joining or leaving node)

Lookups need O(log N) messages

To reestablish routing invariants and finger tables after node joining or leaving, only O(log2N) messages are required

Page 59: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

Structural SearchStructural Search

Distributed, P2P Attributes about the nodes Nodes are connecting via some structures

(ring, grid, or hypergraph)

Objective: Where is X?– X could be some content or a node identity

04/27/2011 DHT 61

Page 60: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 62

Kleinberg’s Basic settingKleinberg’s Basic setting

Page 61: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 63

p, q, rp, q, r

p: lattice distance between one node and all its local neighbors

q: number of long range contacts r: inverse probability [d(u,v)]-r

– What is the intuition about r?– What about r = 0

Page 62: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 64

Kleinberg’s resultsKleinberg’s results

A decentralized routing/search problem– For nodes s,t with known lattice coordinates, find a

short path from s to t. – At any step, can only use local information, – Kleinberg suggests a simple greedy algorithm and

analyzes it:

Page 63: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 65

Local InformationLocal Information

Local contacts Coordinate for the target The locations and long-range contacts of all

nodes that have come in contact with the message.

Page 64: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 66

ResultsResults

If r = 0, expected delivery time is at least a0n2/3.

– Lower bound

If r = 2, p = q = 1, a2(log n)2

– Martel/Nguyen’s newer results

0 <= r < 2 ~ arn(2-r)/3

r > 2 ~ arn(r-2)(r-1)

Page 65: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 67

The Web

Page 66: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

Social Network AnalysisSocial Network Analysis

“Structural relationships” as explanations:

• Network

• Formation

• Influence and collective actions

Page 67: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 69

Social Network AnalysisSocial Network Analysis1. Degree Centrality: The number of direct connections a node has. What really

matters is where those connections lead to and how they connect the otherwise unconnected.

2. Betweenness Centrality: A node with high betweenness has great influence over what flows in the network indicating important links and single point of failure.

3. Closeness Centrality: The measure of closeness of a node which are close to everyone else. The pattern of the direct and indirect ties allows the nodes any other node in the network more quickly than anyone else. They have the shortest paths to all others.

4. Eigenvector Centrality: It assigns relative scores to all nodes in the network based on the principle that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.

Page 68: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 70

Small World ModelSmall World Model

Low Diameter– Logarithmic or poly-logarithmic to N

“High” Cluster Coefficient– cluster coefficient: the portion of X’s neighbors

directly connecting to one of X’s other neighbors

Page 69: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 71

Cluster CoefficientCluster Coefficient

Mesh network: Ccluster = 1

Lattice Network (with degree K): Ccluster = 0

– E.g., a linear line

Page 70: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 72

Re-wiring Re-wiring (Watts/Strogatz)(Watts/Strogatz)

Trade off between D and Ccluster !

Structured/Clustered

Page 71: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 73

Two Issues about Low Two Issues about Low DiametersDiameters

Why should there exist short chains of acquaintances linking together arbitrary pairs of strangers?

Why should arbitrary pairs of strangers be able to find the short chains of acquaintances that link them together?

Page 72: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 74

Some ExtensionsSome Extensions

Hierarchical Network Models Group Structure Models Constant Number of Out-Links

“Small World Phenomena and the Dynamics of Information” by J. Kleinberg, NIPS, 2001

Page 73: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 75

Generation & SearchGeneration & Search

There is a data structure behind and among all the social peers– Lattice, Tree, Group/Community

The link probability depends on this “social data structure”– And, using it to generate the social network

Searching may use “direct contacts” plus the knowledge about the social data structure

Page 74: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 76

Hierarchical Network Hierarchical Network ModelsModels

Representation– a complete b-ary tree, T– All social nodes are “leaves”

Distance and Link Probability– = the height of the least common ancestor

of v and w in T– probability proportional– normalization in probability

– out-degree in graph

f (h(v,w))

f (h(v,x))x≠v

∑€

f (h(v,w))€

h(v,w)

k = c log2 n

Page 75: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 77

the Critical Valuethe Critical Value

h →∞lim

f (h)

b− ′ α h= 0,∀ ′ α < α

h →∞lim

b− ′ ′ α h

f (h)= 0,∀ ′ ′ α > α

f (h(v,w)) ~ b−αh(v,w )

Page 76: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 78

Interpretation (1)Interpretation (1) /Science/Computer_Science/Algorithms

/Arts/Music/Opera

/Science/Computer_Science/Machine_Learning

Page 77: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 79

Interpretation (2)Interpretation (2)

Target: “stock broker @ Boston, MA”

Next hop:– “bishop @ Cambridge, MA”– “banker @ New York City, NY”

Page 78: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 80

ResultsResults

Otherwise, no polylogarithmic search

α =1⇒ Ο(logn)

Page 79: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 81

How to Search in How to Search in HNM??HNM??

f (h(v,w)) ~ b−h(v,w )

f (h(v,w))

f (h(v,x))x≠v

∑€

h(v,w)

k = c log2 n

Page 80: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 82

Useful NeighborUseful Neighbor

v → t

v, t ∈ T

commonAncestor(v, t) = u

Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u

Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T

Is “v” useful to reach “t”?

v t

T

Page 81: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 83

Useful NeighborUseful Neighbor

v → t

v, t ∈ T

commonAncestor(v, t) = u

Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u

Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T

Is “v” useful to reach “t”?

v

u

t

T

′ T

Page 82: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 84

Useful NeighborUseful Neighbor

v → t

v, t ∈ T

commonAncestor(v, t) = u

Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u

Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T

Is “v” useful to reach “t”?

v

u

t

T

′ T

′ ′ T

w

Page 83: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 85

Useful NeighborUseful Neighbor

v → t

v, t ∈ T

commonAncestor(v, t) = u

Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u

Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T

Is “v” useful to reach “t”?

v

u

t

T

′ T

′ ′ T

w

Page 84: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 86

Useful Neighbor Useful Neighbor RecursivelyRecursively

v → t

v, t ∈ T

commonAncestor(v, t) = u

Height( ′ T ) = i,u∈ ′ T ,root( ′ T ) = u

Height( ′ ′ T ) = (i −1), t ∈ ′ ′ T ∧t ∉ ′ ′ T

Is “v” useful to reach “t”?

v

u

T

′ T

′ ′ T

w t

Page 85: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 87

SearchSearch

Find one “useful” neighbor in G as the next step

What happens if NO useful neighbor? Expected steps to reach “t”.

Page 86: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 88

Probability to have 1 Probability to have 1 U.N.U.N.

Z = b−h(v,x )

x≠v

∑ = (b −1)b j−1

j=1

log n

∑ b− j ≤ logn

bi−1leaves∈ ′ ′ T

b−i

logn

bi−1 ×b−i

logn=

1

b log n

(1−1

b log n)c log2 n ≤ n−θ

One leave

All out-links

Page 87: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 89

HNMHNM

High probability to be useful How about “constant links”?

Page 88: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 90

Group StructuresGroup Structures

R is a group; R’ is a strict smaller subgroup

R1, R2,R3,… all contain v, then

q(v,w): minimum size of a group containing both v and w

q = R ≥ 2,v ∈ R ⇒ (v ∈ ′ R ⊆R)∧(q = R > ′ R > λq)

∀i,( Ri ≤ q)∧(v ∈ Ri)⇒i

URi ≤ βq

Page 89: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 91

How to Search in Group How to Search in Group Structure??Structure??

f (q(v,w)) ~ q(v,w)−α

f (q(v,w))

f (q(v,x))x≠v

∑€

q(v,w)

k = c log2 n

Page 90: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 92

IdeaIdea

(v, t) R is the minimum-sized group containing both v and t. With property (1)

Then:

q = R ≥ 2,v ∈ R ⇒ (v ∈ ′ R ⊆R)∧(q = R > ′ R > λq)

∃ ′ R ⇒ (t ∈ ′ R )∧(λ2 R < ′ R < λ R )

How to define “usefulness” of v?

Page 91: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 93

Usefulness of Usefulness of vv

(v, t) R is the minimum-sized group containing both v and t. With property (1)

Then:

q = R ≥ 2,v ∈ R ⇒ (v ∈ ′ R ⊆R)∧(q = R > ′ R > λq)

∃ ′ R ⇒ (t ∈ ′ R )∧(λ2 R < ′ R < λ R )

∃x,(l(v, x) =1)∧(x ∈ ′ R )

Page 92: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 94

Probability to have 1 Probability to have 1 U.N.U.N.

Z = b−h(v,x )

x≠v

∑ = (b −1)b j−1

j=1

log n

∑ b− j ≤ logn

bi−1leaves∈ ′ ′ T

b−i

logn

bi−1 ×b−i

logn=

1

b log n

(1−1

b log n)c log2 n ≤ n−θ

One leave

All out-links

Page 93: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 95

Probability to have 1 Probability to have 1 U.N.U.N.

Z =1

q(v,x)x≠v

∑ ≤ β j +1

j=1

log n

∑ β −( j−1) = β 2 logβ n

(1−λ2

β 2 logβ n)c log2 n ≤ n−θ

Page 94: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 96

ResultsResults

Otherwise, no polylogarithmic search

α =1⇒ Ο(logn)

Page 95: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 97

Fixed Number of Out-Fixed Number of Out-LinksLinks Relax “t” to “a cluster of t”

v t

T

Cl Cl

T

tx

vw€

m = L

r = Cluster

n = m × r

r: Resolution

Page 96: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 98

Question #1Question #1

Why can’t we just treat “Cluster” as “Super Node” and we go home (by applying the HNM results)?

Cl Cl

T

tx

vw€

m = L

r = Cluster

n = m × r

Page 97: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 99

Not necessarilyNot necessarily

Cl Cl

tx

vw

Cl

pq

Page 98: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 100

ProbabilityProbability

f (h(v,w)) ~ (h(v,w) +1)−2b−h(v,w )

Z ≤ 2r

Page 99: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 101

Question #2Question #2

For any out-link of v, what is the probability that the end point of the out-link is in the same cluster of v?

Page 100: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 102

AnswerAnswer

(0 +1)−2b−0 =1

1× r

Z≥

r

2r=

1

2

Page 101: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 103

ResultsResults

If the resolution is polylogarithmic, the the search is polylogarithmic if alpha = 1.

Page 102: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 104

A “Similar” ProcessA “Similar” Process

v

u

T

′ T

′ ′ T

w t

Coloring the Links

Page 103: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/26/2009 Davis Social Links 105

ReadingReading

“Small World Phenomena and the Dynamics of Information” by J. Kleinberg, NIPS, 2001

Page 104: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 106

Page 105: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 107

File OrganizationFile Organization

Piece256KB

Block16KB

File

421 3

Incomplete Piece

Page 106: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 108

InitializationInitialization

tracker

webserveruser

HTTP GET MYFILE.torrent

http://mytracker.com:6969/S3F5YHG6FEBFG5467HGF367F456JI9N5FF4E…

MYFILE.torrent

“register”

ID1 169.237.234.1:6881ID2 190.50.34.6:5692ID3 34.275.89.143:4545…ID50 231.456.31.95:6882

list of peers

Peer 40Peer 2

Peer 1

Page 107: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 109

Peer/Seed

421 3

Page 108: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 110

““On the Wire” ProtocolOn the Wire” Protocol

(Over TCP)

Local PeerRemote Peer

ID/Infohash HandshakeBitField BitField

Interested = 0choked = 1

Interested = 0choked = 1

10 0 10 – choke1 – unchoke2 – interested3 – not interested4 – have5 – bitfield6 – request7 – piece8 – cancel

Non-keepalive messages:

Page 109: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 111

ChokingChoking By default, every peer is “choked”

– stop “uploading” to them, but the TCP connection is still there.

Select 4~6 peers to “unchoke” ??– “Re-choke” every 30 seconds– How to decide?

Optimistic Unchoking– What is this?

Page 110: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 112

““Interested”Interested”

A request for a piece (or its sub-pieces)

Page 111: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 113

Get a piece/block!!Get a piece/block!!

Download:– Which peer? (download from whom? Does it

matter?)– Which piece?

How about “upload”?– Which peer?– Which piece?

Page 112: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 114

Piece SelectionPiece Selection

Pipelining (5 requests) Strict Priority (incomplete pieces first) Rarest First

What is the problem?

Page 113: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 115

Rarest FirstRarest First

Exchanging bitmaps with 20+ peers– Initial messages– “have” messages

Array of buckets– Ith buckets contains “pieces” with I known

instances– Within the same bucket, the client will

randomly select one piece.

Page 114: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 116

Piece SelectionPiece Selection

Pipelining (5 requests) Strict Priority 3 stages:

– Random first piece– Rarest First– Endgame mode

Page 115: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 117

Piece SelectionPiece Selection Piece (64K~1M) Sub-piece (16K)

– Piece-size: trade-off between performance and the size of the torrent file itself

– A client might request different sub-pieces of the same piece from different peers.

Strict Priority - sub-pieces and piece Rarest First

– Exception: “random first”

– Get the stuff out of Seed(s) as soon as possible..

Page 116: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 118

Get a piece/block!!Get a piece/block!!

Download:– Which peer?– Which piece?

How about “upload”?– Which peer?– Which piece?

Page 117: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 119

Peer SelectionPeer Selection

Focus on Rate Upload to 4~6 peers Random Unchoke Global rate cap only

Page 118: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 120

Bittorrent: “Tit for Tat”Bittorrent: “Tit for Tat”

Equivalent Retaliation (Game theory)– A peer will “initially” cooperate, then respond

in kind to an opponent's previous action. If the opponent previously was cooperative, the agent is cooperative. If not, the agent is not.

Page 119: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 121

ChokingChoking By default, every peer is “choked”

– stop “uploading” to them, but the TCP connection is still there.

Select 4~6 peers to “unchoke” ??– Best “upload rates” and “interested”.– Uploading to the unchoked ones and monitor the

download rate for all the peers– “Re-choke” every 30 seconds

Optimistic Unchoking (6+1)– Randomly select a choked peer to unchoke

Page 120: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 122

BittorrentBittorrent Fairness of download and upload between a

pair of peers Every 10 seconds, estimate the download

bandwidth from the other peer– Based on the performance estimation to decide

to continue uploading to the other peer or not

Page 121: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 123

PropertiesProperties

Bigger “%” = better chance of unchoked Bigger “%” ~= better UL and DL rates ?!

Page 122: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 124

Peer/Seed

421 3

Who to Unchoke?Who to Unchoke?

Page 123: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 125

Seed unchokingSeed unchoking old algorithm

– unchoke the fastest peers (how?)– problem: fastest peers may monopolize seeds

new algorithm periodically sort all peers according to their last unchoke time prefer the most recently unchoked peers; on a tie, prefer the fastest (presumably) achieves equal spread of seed bandwidth

Page 124: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 126

Seed unchokingSeed unchoking old algorithm

– unchoke the fastest peers (how?)– problem: fastest peers may monopolize seeds

new algorithm periodically sort all peers according to their last unchoke time prefer the most recently unchoked peers; on a tie, prefer the fastest (presumably) achieves equal spread of seed bandwidth

Page 125: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 127

Attacks to BTAttacks to BT

???

Page 126: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 128

Attacks to BTAttacks to BT

Download only from the seeds Download only from fastest peers Announcing false pieces Privacy -- (Torrent, source IP addresses)

Page 127: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 129

BitTorrent: Questions to BitTorrent: Questions to askask

Peer’s role (or SP’s role) Peer’s controllability and vulnerability Incentives to contribute Peer’s mobility and dynamics Scalability

Page 128: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 130

BittorrentBittorrent

“Tic-for-Tat” incentive model within the same torrent

Piece/Peer selection and choking The need for tracker and torrent file

Page 129: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 131

Client implementationsClient implementations mainline: written in Python; right now, the only

one employing the new seed unchoking algorithm Azureus: the most popular, written in Java;

implements a special protocol between clients(e.g. peers can exchange peer lists)

other popular clients: ABC, BitComet, BitLord, BitTornado, μTorrent, Opera browser

various non-standard extensions– retaliation mode: detect compromised/malicious peers– anti-snubbing: ignore a peer who ignores us– super seeding: seed masquerading as a leecher

Page 130: 04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #5: Distributed Hash Table Dr. S. Felix Wu Computer Science Department University.

10/23/2007 P2P 132

ResourcesResources Basic BitTorrent mechanisms

[Cohen, P2PECON’03] BitTorrent specification Wiki

http://wiki.theory.org/BitTorrentSpecification Measurement studies

[Izal et al., PAM’04], [Pouwelse et al., Delft TR 2004 and IPTPS’05], [Guo et al., IMC’05], and[Legout et al., INRIA-TR-2006]

Theoretical analysis and modeling [Qiu et al., SIGCOMM’04], and[Tian et al., Infocom’06]

Simulations [Bharambe et al., MSR-TR-2005]

Sharing incentives and exploiting them [Shneidman et al., PINS’04],[Jun et al., P2PECON’05], and[Liogkas et al., IPTPS’06]