T-110.5140 Network Application Frameworks and XML ... · 1 of 20 T-110.5140 Network Application...
Transcript of T-110.5140 Network Application Frameworks and XML ... · 1 of 20 T-110.5140 Network Application...
1 of 20
T-110.5140 Network Application Frameworks and XML
Distributed Hash Tables (DHTs)
1.2.2010
Tancred Lindholm, Sasu Tarkoma
2
Contents
� Motivation
� Terminology
� Distributed Hash Tables (DHTs)
� Overlays
� Clusters and wide-area
� Non-filesharing applications for DHTs
� Internet Indirection Infrastructure (i3)
� Delegation Oriented Architecture (DOA)
3
DHT Motivation
� Challenges in the wide-area
� Scalability
� Increasing number of users, requests, files, traffic
� Resilience: more components -> more failures
� Management: intermittent resource availability -> complex management
� DHTs promise to address these
4
DHT Motivation (cont.)
� Directories are needed
� Name resolution & lookup
� Mobility support with fast updates
� Required properties
� Fast updates
� Scalability
� Reliability
� DNS has limitations
� Update latency
� Administration
� Alternative: DynDNS
� Still centralized management
� Inflexible, reverse DNS?
5
Terminology
� Peer-to-peer (P2P)
� Different from client-server model
� Each peer has both client/server features
� Distributed Hash Tables (DHT)
� An algorithm for creating efficient distributed hash
tables (lookup structures)
� Used to implement overlay networks
� Overlay networks
� Routing systems that run on top of another
network, such as the Internet
� Typical features of P2P / overlays
� Scalability, resilience, high availability, and they
tolerate frequent peer connections and
disconnections
6
Peer-to-peer in more detail
� A P2P system is distributed
� No centralized control
� Nodes are symmetric in functionality
� Comparatively large faction of nodes are unreliable
� Nodes come and go
� P2P enabled by evolution in data communications and algorithms (DHT)
� P2P driven by mostly illegal file sharing?
� Current challenges:
� Security (zombie networks,trojans), IPR issues
� P2P systems are decentralized overlays
7
Evolution of P2P systems
� Started from centralized servers
� Napster� Centralized directory, transfer P2P
� Central directory single point of failure
� Second generation used flooding
� Gnutella� Local directory for each peer
� Search by flooding network with queries
� High cost, worst-case O(N) messages
� Research systems use DHTs
� Chord, Tapestry, CAN, ..
� Decentralization, scalability
� Data lookup by fixed key in file sharing apps
� e.g. BitTorrent DHT extension
� not suited for free-form queries
8
DHT interfaces
� DHTs offer typically two functions
� put(key, value)
� get(key) --> value
� optionally delete(key)
� Supports wide range of applications
� Similar interface to UDP/IP
� Send(IP address, data)
� Receive(IP address) --> data
� No restrictions are imposed on the
semantics of values and keys
� Commonly key of value = hash(value)
� Key/value pairs are persistent and global
9
Distributed applications
Distributed Hash Table (DHT)
Node Node Node Node
put(key, value) get(key) valueDHT balances keys and
data across nodes
10
Some DHT applications
� Storage-related
� File sharing
� Web caching
� Censor-resistant data storage
� Backup storage
� Web archive
� Event notification
� Naming systems
� Query and indexing
� Communication primitives
Examples in thislecture
11
DHT Example 1: Chord
� Chord is an overlay algorithm from MIT
� Stoica et. al., SIGCOMM 2001
� Chord is a lookup structure (a directory)
� Idea resembles binary search
� Support for rapid joins and leaves
� Churn
� Maintains routing tables
12
Chord
� Uses consistent hashing to map keys to nodes
� Consistent hashing means any node computes same node for a given value (compare to using distributed protocols to obtain an id)
� Keys are hashed to m-bit identifiers
� Nodes have m-bit identifiers
� IP-address is hashed
� SHA-1 is used as the baseline algorithm
13
Chord routing I
� A node has a well determined place within the ring� Identifiers are ordered on an identifier circle modulo
2m ("clock math", e.g. 5+7 mod 23 = 4)� The Chord ring with m-bit identifiers
� A node has a predecessor and a successor� A node stores the keys between its predecessor
(inclusive) and itself (exclusive)� The (key, value) is stored on the successor node of
key
� A routing table (finger table) keeps track of other nodes
� NOTE: Slightly different descriptions in the literature (range endpoint behavior), the idea is the same though
14
Finger Table
� Each node maintains a routing table with at most m entries
� Each node also knows its predecessor
� The i:th entry of the table at node n
contains the identity of the first node, s,
that succeeds n by at least 2i-1 on the identifier circle
� s = successor(n + 2i-1)
� The i:th finger of node n
15
N1
N8
N14
N21
N32
N38
N42
N51
N56
2m-1 0
+1+2
+4
+8
+16
+32
Finger Maps to Real node
1,2,3
4
5
6
x+1,x+2,x+4
x+8
x+16
x+32
N14
N21
N32
N42
m=6
for j=1,...,m the
fingers of p+2j-1
Predecessor node
E.g. N32 responsible for keys 21 to 31
16
Chord routing II
� Routing steps� check whether the key k is found between n
and the successor of n → successor(n)
� if not, forward the request to the closest finger preceding k
� Each knows a lot about nearby nodes and less about nodes farther away
� The target node will be eventually found� Find right halfcircle
� Then right quadrant
� Then next 1/8th
� etc.
17
Chord lookup
N1
N8
N14
N21
N32
N38
N42
N51(keys 42..50)
N56
2m-1 0m=6
+16
+8
get(50)?
18
Invariants
� Two invariants:
� Each node's successor is correctly maintained.
� For every key k, node successor(k) is responsible for k.
� A node stores the keys between its
predecessor and itself
� The (key, value) is stored on the successor node of key
19
Join
� A new node n joins
� Needs to know an existing node n’
� Three steps
� 1. Initialize the predecessor and fingers of node
� 2. Update the fingers and predecessors of existing nodes to reflect the addition of n
� 3. Notify the higher layer
� Leave uses steps 2. (update removal)
and 3. (relocate keys)
20
Chord Join Overview
Node 26 joins the system between nodes 21 and 32. The arcs represent the successor relationship.
(a) Initial state: node 21 points to node 32; (b) node 26 finds its successor (i.e., node 32) and points to it; (c) node 26 copies all keys less than 26 from node 32;
(d) the stabilize procedure updates the successor of node 21 to node 26.
Source: Stoica et al., Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
21
Chord Properties
� Each node is on average responsible for K/N keys (K is the number of keys, N is
the number of nodes)
� When a node joins or leaves the network
only O(K/N) keys will be relocated
� Lookups take O(log N) messages
� To re-establish routing invariants after
join/leave O(log2 N) messages are needed
22
DHT Example 2: Tapestry
� DHT developed at UCB
� Zhao et. al., UC Berkeley TR 2001
� Used in OceanStore
� Secure, wide-area storage service
� Tree-like geometry
� Suffix - based hypercube
� 160 bits identifiers
� Suffix routing from A to B
� hop(h) shares suffix with B of length digits
� Tapestry Core API:
� publishObject(ObjectID,[serverID])
� routeMsgToObject(ObjectID)
� routeMsgToNode(NodeID)
23
Suffix routing
0312 routes to 1643 via
0312 -> 2173 -> 3243 -> 2643 -> 1643
1 hop: shares 1
suffix with 16432 hop: shares 2
suffix with 1643
3 hop: shares 3
suffix with 1643
Rounting table with b*logb(N) entries
Entry(i,j) – pointer to the neighbour ...ji (i = suffix)
24
Pastry I
� A DHT based on a circular flat identifier space like Chord
� Prefix-routing
� Message is sent towards a node which is
numerically closest to the target node
� Procedure is repeated until the node is found
� Prefix match: number of identical digits before the
first differing digit
� Prefix match increases or numerical distance
decreases by every hop
� Customizable with different distance metrcis
� Similar performance to Chord
25
Pastry II
� Routing a message
� If the list of closeby nodes ("leaf set") has
the key --> send to node directly
� else send to the identifier in the routing
table with the longest common prefix (longer then the current node)
� else query leaf set for a numerically closer node with the same prefix match
as the current node
� Used in event notification (Scribe)
26
Security Considerations
� Malicious nodes
� Attacker floods DHT with data
� Attacker returns incorrect data
� self-authenticating data (hash of value is key)
� Attacker denies data exists or supplies incorrect routing
info
� Basic solution: using redundancy
� k-redundant networks
� What if attackers have quorum, i.e. nodes strategically placed to defeat redundancy?
� Need a way to control creation of node Ids
� Solution: secure node identifiers
� Use public keys
27
Overlay Networks
� Origin in Peer-to-Peer (P2P)
� Builds upon Distributed Hash Tables
(DHTs)
� Easy to deploy
� No changes to routers or TCP/IP stack
� Typically on application layer
� Overlay properties
� Resilience
� Fault-tolerance
� Scalability
28
Overlay Networks (cont.)
Overlay Network
Logical Topology(e.g. Chord)
Node "real" topology in IP network
29
Upper layers
(Apps, Middleware)
Overlay
Congestion
End-to-end
Routing
TCP
30
Distributed Data Structures (DSS)
� DHTs are an example of DSS
� DHT algorithms are available for clusters
and wide-area environments
� They are different!
� Cluster-based solutions
� Ninja
� LH* and variants
� Wide-area solutions
� Chord, Tapestry, ..
� Flat DHTs, peers are equal
� Maintain a subset of peers in a routing table
31
Cluster vs. Wide-area
� Clusters are
� single, secure, controlled, administrative domains
� engineered to avoid network partitions
� low-latency, high-throughput SANs
� predictable behaviour, controlled environment
� Wide-area
� Unstable configuration: nodes join & leave
� Heterogeneous networks
� Unpredictable delays and packet drops
� Multiple administrative domains
� Network partitions possible
32
Wide-area requirements
� Easy deployment
� Scalability to millions of nodes and
billions of data items
� Availability
� Copes with routine faults
� Self-configuring, adaptive to network
changes
� Takes locality into account
33
Distributed Data Structures (DSS) for Clusters
� Ninja project (UCB)
� New storage layer for cluster services
� Partition conventional data structure across nodes in a cluster
� Replicate partitions with replica groups in cluster
� Availability
� Sync replicas to disk (durability)
� Other DSS for data / clusters
� Amazon Dynamo Storage
� Dynamo: Amazon's Highly Available Key-Value Store (SOSP 2007)
� LH* Linear Hashing for Distributed Files
� Redundant versions for high-availability
34
Cluster-based Distributed Hash Tables (DHT) in Ninja
� Directory for non-hierarchical data
� Several different ways to implement
� A distributed hash table
� Consists of “bricks" which each maintains a partial map
� Adding bricks increases capacity/ performance
� Resilience through parallel, unrelated
mappings
35
client client client client
service
DSS lib
service
DSS lib
storage
“brick”
storage
“brick”
storage
“brick”
storage
“brick”
storage
“brick”
storage
“brick”
SAN
Service interacts
with DSS lib
Hash table API
Redundant, low
latency, high
throughput
network
Brick = single-
node, durable
hash table,
replicated
clients interact
with any
service
“front-end”
Ninja Architecture
36
Applications for DHTs
� DHTs are used as a basic building block for an application-level infrastructure
� Internet Indirection Infrastructure (i3)
� New forwarding infrastructure based on Chord
� DOA (Delegation Oriented Architecture)
� New naming and addressing infrastructure
based on overlays
37
Internet Indirection Infrastructure (i3)
� A DHT - based overlay network
� Based on Chord
� Aims to provide more flexible communication model than current IP
addressing
� Decouples sender from receiver by introducing indirection point
� One proposal to fix some fundamental problems in the Internet
38
i3 II
� i3 packets are sent to identifiers
� each identifier is routed to the i3 node
responsible for that identifier
� the node maintains triggers that are
installed by receivers
� when a matching trigger is found the
packet is forwarded to the receiver
39
i3 III
� An i3 identifier may be bound to a host, object, or a session
� i3 has been extended to
� Allows end hosts to control the placement of rendezvous-points (indirection points) for efficient routing and handovers
� This is Robust Overlay Architecture for Mobility (ROAM)
� Legacy application support
� user level proxy for encapsulating IP packets to i3 packets
40
Source: http://i3.cs.berkeley.edu/
R inserts a trigger (id, R) and receives
all packets with identifier id.
the host changes its address from R1 to R2,
it updates its trigger from (id, R1) to (id, R2).
Mobility is transparent for the sender
41
Source: http://i3.cs.berkeley.edu/
A multicast tree using a hierarchy of triggers
42
Source: http://i3.cs.berkeley.edu/
Anycast using the longest matching prefix rule.
Suffixes indirectly give
preference to receivers in group
43
Source: http://i3.cs.berkeley.edu/
Sender-driven service composition using
a stack of identifiers
Receiver-driven service composition using
a stack of identifiers
44
Layered Naming Architecture
� Presented in paper:
� A Layered Naming Architecture for the Internet, Balakrishnan et al. SIGCOMM 2004
� What is wrong with http://images.org/tux.jpg ?
1. OK: names an image
2. BAD: tells us HOW and WHERE to reach it
3. BAD: WHERE (DNS->IP) is bound to Inet toplogy
� Service Identifiers (SIDs) are host-independent data or serice names (fix point 2)
� End-point Identifiers (EIDs) are location-independent host names (fix point 3)
� E.g.
1. Resolve URL to SID = persistent name of data
2. Resolve SID to transport and EID = how and from whom to get data
3. Resolve EID to IP = address to data
45
Layered Naming Architecture
� Protocols bind to names and resolve them� Applications use SIDs as handles
� SIDs and EIDs should be flat� Stable-bame principle: A stable name should
not impose restrictions on the entity it names
� Inspiration: HIP + i3 + Semantic Free Referencing
� Prototype: Delegation Oriented Architecture (DOA)
46
IP IP
Transport
App session
User level descriptors (search query..)
Search returns SIDs
SIDs are resolved to EIDs
Resolves EIDs to IP
Transport
App session
Bind to EID
Use SID as handle
47
DOA cont.
� DOA header is located between IP and
TCP headers and carries source and
destination EIDs
� The mapping service maps EIDs to IP
addresses or 1 or more EIDS, which are intermediaries
� This allows the introduction of various middleboxes to the routing of packets
� Service chain, end-point chain
� Outsourcing of intermediaries
� Ideally clients may select the most useful network elements to use
48
OpenDHT
� A publicly accessible distributed hash table (DHT) service
� Discontinued in 2009
� OpenDHT ran on a collection of 200 - 300 nodes on PlanetLab.
� Client do not need to participate as DHT nodes.
� Used bamboo DHT
� www.opendht.org
� OpenDHT: A Public DHT Service and Its Uses. Sean Rhea, Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy, Scott Shenker, Ion Stoica, and Harlan Yu. Proceedings of ACM SIGCOMM 2005, August 2005.
49
Summary
� Mobility and multi-homing require directories
� Scalability, low-latency updates
� Overlay networks have been proposed
� Searching, storing, routing, notification,..
� Lookup (Chord, Tapestry, Pastry), coordination
primitives (i3), middlebox support (DOA)
� Logarithmic scalability, decentralised,…
� Many applications for overlays
� Lookup, rendezvous, data distribution and
dissemination, coordination, service composition,
general indirection support
50
Discussion Points
� Why do most popular P2P programshave simple algorithms?
� How secure should P2P / overlays be?
� How much will DHTs be deployed in
commercial software?
� Will DHTs be part of future core Internet?
� Will there be more name layers (outside
single apps) or are such fundamental changes unrealistic?
� the joke is that DOA = Dead on Arrival
51 of 20
T-110.5140 Network Application Frameworks and XML
Additional material: Chord Joins and Leaves
52
Invariants
� Two invariants:
� Each node's successor is correctly maintained.
� For every key k, node successor(k) is responsible for k.
� A node stores the keys between its
predecessor and itself
� The (key, value) is stored on the successor node of key
53
Join
� A new node n joins
� Needs to know an existing node n’
� Three steps
� 1. Initialize the predecessor and fingers of node
� 2. Update the fingers and predecessors of existing nodes to reflect the addition of n
� 3. Notify the higher layer software and transfer keys
� Leave uses steps 2. (update removal)
and 3. (relocate keys)
54
1. Initialize routing information
� Initialize the predecessor and fingers of the new node n
� n asks n’ to look predecessor and fingers
� One predecessor and m fingers
� Look up predecessor
� Requires log (N) time, one lookup
� Look up each finger (at most m fingers)
� log (N), we have Log N * Log N
� O(Log2 N) time
55
Steps 2. And 3.
� 2. Updating fingers of existing nodes
� Existing nodes must be updated to reflect the new node (any any keys that are transferred to the new node)
� Performed counter clock-wise on the circle
� Algorithm takes i:th finger of n and walks in the
counter-clock-wise direction until it encounters
a node whose i:th finger precedes n
� Node n will become the i:th finger of this node
� O(Log2 N) time
� 3. Transfer keys
� Keys are transferred only from the node immediately following n
56
References
� http://www.pdos.lcs.mit.edu/chord/