CS294, Yelick DataStructs, p1
CS 294-8Distributed Data
Structureshttp://www.cs.berkeley.edu/~yelick/294
CS294, Yelick DataStructs, p2
Agenda• Overview• Interface Issues• Implementation Techniques• Fault Tolerance• Performance
CS294, Yelick DataStructs, p3
Overview• Distributed data structures are an
obvious abstraction for distributed systems. Right?
• What do you want to hide within one?– Data layout?– When communication is required?– # and location of replicas– Load balancing
CS294, Yelick DataStructs, p4
Distributed Data Structures• Most of these are containers • Two fundamentally difference
kinds:– Those with integrators or ability to
look at all container elements• Arrays, meshes, databases*, graphs* and
trees* (sometimes)
– Those with only single element ops• Queue, directory (hash table or tree), all
*’d items above
CS294, Yelick DataStructs, p5
DDS in Ninja• Described in Gribble, Brewer,
Hellerstein, Culler• A distributed data structure (DDS) is a
self-managing layer for persistent data.– High availability, concurrency, consistency,
durability, fault tolerance, scalability
• A distributed hash table is an example – Uses two-phase commits for consistency– Partitioning for scalability
CS294, Yelick DataStructs, p6
Scheduling Structures• In serial code, most scheduling is
done with a stack (often implicit), a FIFO queue, or a priority queue
• Do all of these makes sense in a distributed setting?
• Are there others?
CS294, Yelick DataStructs, p7
Distributed Queues• Load balancing (work stealing…)
– Push new work onto a stack– Execute locally by popping from the
stack– Steal remotely by removing from the
bottom of the stack (FIFO)
CS294, Yelick DataStructs, p8
Interfaces (1)• Blocking atomic interfaces: operations
happen between invocation and return– Internally each operation performs locking or
other form of synchronization
• Non-blocking “atomic” interfaces: operation happens sometime after invocation– Often paired with completion synchronization
• Request/response for each operation• Wait for all “my” operations to complete• Wait for all operations in the world to complete
CS294, Yelick DataStructs, p9
Interfaces (2)• Non-atomic interface: use external
synchronization– Undefined under certain kinds (or all)
concurrency– May be paired with bracketing
synchronization• Aquire-insert-lock, insert, insert, Release-insert-lock• Begin-transaction…
• Operations with no semantics (no-ops)– Prefetch, Flush copies, …
• Operations that allow for failures– Signal “failed”
CS294, Yelick DataStructs, p10
DDS Interfaces• Contrast:
– RDBMS’s provide ACID semantics on transactions
– Distributed files systems: NFS weak, Frangipani and AFS stronger
• DDS:– All operations on elements are atomic
(indivisible, all or nothing)• This seems to mean that the hash table operations
that involve a single element are atomic
– One-copy equivalence: replication of elements is invisible
– No transaction across elements or operations
CS294, Yelick DataStructs, p11
Implementation Strategies (1)
• Two simple techniques– Partitioning:
• Used when the d.s. is large• Used when writes/updates are frequent
– Replication:• Used when writes are infrequent and
reads are very frequent• Used to tolerate failures• Full static replication is extreme; dynamic
partial replication is more common
• Many hybrids and variations
CS294, Yelick DataStructs, p12
Implementation Strategies (2)
• Moving data to computation good for:– dynamic load balancing
• I.e., idle processors grab work
– smaller objects in ops involving > 1 object
• Moving computation to data good for:– large data structures
• Other?
CS294, Yelick DataStructs, p13
DDS: Distributed Hash Table• Operations include:
– Create, Destroy – Put, Get, and Remove
• Built with storage “bricks”– Each manage a single node, network-visible
hash table– Contain a buffer cache, lock manager,
network stubs and skeletons
• Data is partitioned, and partitions are replicated– Replica groups are used for each partition
CS294, Yelick DataStructs, p14
DDS: Distributed Hash Table• Operations on elements:
– Get – use any replica in appropriate group
– Put or remove – update all replicas in group using two-phase commit• DDS library is commit coordinator• If individual node crashes during commit
phase, it is removed from replica• If DDS fails during commit phase, individual
nodes will coordinate: if any have committed, all must
CS294, Yelick DataStructs, p15
DDS: Hash Table
RG name
RG members
000 dds1,dds2
100 dds2
10 dds5,dds4
01 dds7
011 dds5,dds3
111 dds2
Key: 110011
0 1
0
0
1
1
1
10
0
DP map
RG map
CS294, Yelick DataStructs, p16
Example: Aleph Directory• Maps names to mobile objects
– Files, locks (?), processes,…
• Interested in performance at scale, not reliability
• Two basic protocols:– Home: each object has a fixed
“home” PE that keeps track of cache copies
– Arrow: based on path-reversal idea
CS294, Yelick DataStructs, p17
Path ReversalFind
CS294, Yelick DataStructs, p18
Path Reversal
CS294, Yelick DataStructs, p19
Aleph Directory Performance• Aleph is implemented as Java
packages on top of RMI (and UDP?)• Run on small systems (up to 16
nodes)– Assumed that “home” centralized
solution would be faster at this scale• 2 messages to request; 2 to retrieve
– Arrow was actually faster• Log2 p to request; 1 to retrieve
• In practice, only 2 to request (counter ex.)
CS294, Yelick DataStructs, p20
Hybrid Directory Protocol• Essentially the same as the “home”
protocol, except• Link waiting processors into a chain
(across the processors)– Each keeps the id of the processor ahead of
it in the chain
• Under high contention, resource moves down the chain
• Performance:– Faster than home and arrow on counter
benchmark and some others…
CS294, Yelick DataStructs, p21
How Many Data Structures?• Gribble et al claim:
– “We believe that given a small set of DDS types (such as a hash table, a tree, and an administrative log), authors will be able to build a large class of interesting and sophisticated servers.”
– Do you believe this?– What does it imply about tools vs.
libraries?
CS294, Yelick DataStructs, p22
Administrivia• Gautam Kar and Joe L. Hellerstein
speaking Thursday– Papers online– Contact me about meeting with them
• Final projects: – Send mail to schedule meeting with me
• Next week:– Tuesday: guest lecture by Aaron Brown on
benchmarks; related to Kar and Hellerstein work.– Still to come: Gray, Lamport, and Liskov
Top Related