Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure
-
Upload
germaine-valencia -
Category
Documents
-
view
15 -
download
0
description
Transcript of Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure
Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T. Rowstron
IEEE Journal on Selected Areas in Communications, Oct, 2002
Outline Pastry
A peer-to-peer location and routing substrate
Scribe Built on top of Pastry
Experimental evaluation Delay penalty Node stress (routing tables) Link stress (network bandwidth)
Pastry (1/2)
Each Pastry node has a unique, 128-b nodeId. The set of existing nodeIds is uniformly di
stributed. This is achieved by basing the nodeId on
a secure hash of the node’s public key or IP address.
Pastry (2/2) Each node contains
Routing tables (some of live nodes) Each entry maps a nodeId to the associated no
de’s IP address. IP addresses for the nodes in its “leaf set
”. Leaf set (total l nodes)
The set of nodes with l/2 numerically closest larger nodeId l/2 numerically closest smaller nodeId
Routing Given a message and a key, Pastry reliably r
outes the message to the node with the nodeId that is numerically closest to the key among all live nodes.
In each routing step, the current node normally forwards the message to a node whose nodeId shares a longer prefix with the key.
The key can be different from the destination nodeId.
Routing a messageFrom node 65a1fc with key d46a1c
Locality properties Short routes property
Concern the total distance that messages travel along Pastry routes.
In each step, a message is routed to the nearest node with a longer prefix match.
Route convergence property Concern the distance traveled by two
messages sent to the same key before their routes converge.
A B
C
E
ConvergeD
Node addition The new nodeId X can initialize its state by contactin
g a nearby node A. A will route a special message using X as the key. This message is routed to the existing node Z with n
odeId numerically closest to X. X then obtains
the leaf set the routing table
from Z. Z is the nearest node, so their leaf sets are almost the same. Their routing tables are very similar.
Failure To handle node failures, neighboring nodes
in the nodeId space periodically exchange keep-alive messages.
If a node is unresponsive for a period T, it is presumed failed.
All members of the failed node’s leaf set are then notified and they update their leaf sets.
Routing table entries that refer to the failed nodes are repaired lazily.
Scribe Scribe uses Pastry to manage
group creation, group joining and to build a per-group multicast tree.
Implementation CREATE JOIN MULTICAST LEAVE
Multicast tree creation
1100
1111
1001
0100
0111
1100
CREATE
0111
JOIN
1001
forwarder
0100
JOIN
11011101
forwarder
1111
forwarder
b = 1 ( match 1 bit at a time)
Because b = 1, so both 1111 and 1101 can be a forwarder.
Membership Rendezvous point
The root of the multicast tree. Can be changed.
Forwarder Scribe nodes that are part of a group’s
multicast tree. They may or may not be member of the
group. Each forwarder maintains a children table.
Multicast message dissemination Multicast sources use Pastry to locate the
rendezvous point of a group. They route to the rendezvous point and ask it to
return its IP address. They cache the rendezvous point’s IP address and
use it in subsequent multicasts to the group. Multicast messages are disseminated from
the rendezvous point along the multicast tree.
Why? Each multicast source can also be viewed as the root.
If each multicast source transmit data by itself, the delay penalty in worst case can become twice.
Reliability Each nonleaf node in the tree sends a heart
beat message to its children. A child suspects that its parent is faulty whe
n it fails to receive heartbeat messages. Upon detection of the failure of its parent, a
node calls Pastry to route a JOIN message to a new parent.
If the failed node is the root, a new root (the live node with the numerically closet nodeId to the groupId) will replace it.
Experimental evaluation Compare with IP multicast
Delay penalty Node stress Link stress
Experimental setup A network topology with 5,050 routers Scribe run on 100,000 end nodes. 1,500 groups
Delay penalty Scribe increases the delay to deliver
messages relative to IP multicast. RMD
The ratio between the maximum delay using Scribe and the maximum delay using IP multicast.
RAD The ratio between the average delay using
Scribe and the average delay using IP multicast.
Delay penalty
Scribe / IP multicast
The number of groups with a RAD or RMD lower than or equal to the relative delay.
Node stress (1/2)
Node stress (2/2)
Each node averagely remembers few children.
Long tail
Link stress
IP multicast
950
Scribe
4031
Bottleneck remover (1/3) Reasons
Some node may have less computational power or bandwidth available than others.
The distribution of children table entries has a long tail.
Algorithm When a node is overloaded, it selects the
group that consumes the most resources. It chooses the child in this group that is
farthest away.
Bottleneck remover (2/3) The parent drops the chosen child by
sending it a message containing the children table for the group.
When the child receives the message, It measures the delay between itself and
other nodes in the table. It computes the total delay between itself
and the parent via each node in the table.
It sends a join message to the node that provides the smallest combined delay.
Bottleneck remover (3/3)
Node stress
No long tail
Scalability
Evaluating Scribe’s scalability with a large number of groups.
Experimental setup 50,000 Scribe nodes 30,000 groups with 11 members
Node stress (1/2)
Collapse will be introduced later.
Node stress (2/2)
Scribe is inappropriate to small groups!
Long tail
Scribe collapse (1/2) If a multicast group has few
members, the group may require many other nodes to become forwarders. (The tree is inefficient.)
The new algorithm collapses long paths in the tree. Removing nodes that are not members
of a group and have only one entry on the group’s children table.
Scribe collapse (2/2)
Link stress
Naïve unicast
Scribe
IP multicast
Scribe collapse