Download - CS 294-8 Distributed Data Structures cs.berkeley/~yelick/294

CS294, Yelick DataStructs, p1

CS 294-8Distributed Data

Structureshttp://www.cs.berkeley.edu/~yelick/294


Agenda• Overview• Interface Issues• Implementation Techniques• Fault Tolerance• Performance


Overview• Distributed data structures are an

obvious abstraction for distributed systems. Right?

• What do you want to hide within one?– Data layout?– When communication is required?– # and location of replicas– Load balancing


Distributed Data Structures• Most of these are containers • Two fundamentally difference

kinds:– Those with integrators or ability to

look at all container elements• Arrays, meshes, databases*, graphs* and

trees* (sometimes)

– Those with only single element ops• Queue, directory (hash table or tree), all

*’d items above


DDS in Ninja• Described in Gribble, Brewer,

Hellerstein, Culler• A distributed data structure (DDS) is a

self-managing layer for persistent data.– High availability, concurrency, consistency,

durability, fault tolerance, scalability

• A distributed hash table is an example – Uses two-phase commits for consistency– Partitioning for scalability


Scheduling Structures• In serial code, most scheduling is

done with a stack (often implicit), a FIFO queue, or a priority queue

• Do all of these makes sense in a distributed setting?

• Are there others?


Distributed Queues• Load balancing (work stealing…)

– Push new work onto a stack– Execute locally by popping from the

stack– Steal remotely by removing from the

bottom of the stack (FIFO)


Interfaces (1)• Blocking atomic interfaces: operations

happen between invocation and return– Internally each operation performs locking or

other form of synchronization

• Non-blocking “atomic” interfaces: operation happens sometime after invocation– Often paired with completion synchronization

• Request/response for each operation• Wait for all “my” operations to complete• Wait for all operations in the world to complete


Interfaces (2)• Non-atomic interface: use external

synchronization– Undefined under certain kinds (or all)

concurrency– May be paired with bracketing

synchronization• Aquire-insert-lock, insert, insert, Release-insert-lock• Begin-transaction…

• Operations with no semantics (no-ops)– Prefetch, Flush copies, …

• Operations that allow for failures– Signal “failed”


DDS Interfaces• Contrast:

– RDBMS’s provide ACID semantics on transactions

– Distributed files systems: NFS weak, Frangipani and AFS stronger

• DDS:– All operations on elements are atomic

(indivisible, all or nothing)• This seems to mean that the hash table operations

that involve a single element are atomic

– One-copy equivalence: replication of elements is invisible

– No transaction across elements or operations


Implementation Strategies (1)

• Two simple techniques– Partitioning:

• Used when the d.s. is large• Used when writes/updates are frequent

– Replication:• Used when writes are infrequent and

reads are very frequent• Used to tolerate failures• Full static replication is extreme; dynamic

partial replication is more common

• Many hybrids and variations


Implementation Strategies (2)

• Moving data to computation good for:– dynamic load balancing

• I.e., idle processors grab work

– smaller objects in ops involving > 1 object

• Moving computation to data good for:– large data structures

• Other?


DDS: Distributed Hash Table• Operations include:

– Create, Destroy – Put, Get, and Remove

• Built with storage “bricks”– Each manage a single node, network-visible

hash table– Contain a buffer cache, lock manager,

network stubs and skeletons

• Data is partitioned, and partitions are replicated– Replica groups are used for each partition


DDS: Distributed Hash Table• Operations on elements:

– Get – use any replica in appropriate group

– Put or remove – update all replicas in group using two-phase commit• DDS library is commit coordinator• If individual node crashes during commit

phase, it is removed from replica• If DDS fails during commit phase, individual

nodes will coordinate: if any have committed, all must


DDS: Hash Table

RG name

RG members

000 dds1,dds2

100 dds2

10 dds5,dds4

01 dds7

011 dds5,dds3

111 dds2

Key: 110011

0 1

0

0

1

1

1

10

0

DP map

RG map


Example: Aleph Directory• Maps names to mobile objects

– Files, locks (?), processes,…

• Interested in performance at scale, not reliability

• Two basic protocols:– Home: each object has a fixed

“home” PE that keeps track of cache copies

– Arrow: based on path-reversal idea


Path ReversalFind


Path Reversal


Aleph Directory Performance• Aleph is implemented as Java

packages on top of RMI (and UDP?)• Run on small systems (up to 16

nodes)– Assumed that “home” centralized

solution would be faster at this scale• 2 messages to request; 2 to retrieve

– Arrow was actually faster• Log2 p to request; 1 to retrieve

• In practice, only 2 to request (counter ex.)


Hybrid Directory Protocol• Essentially the same as the “home”

protocol, except• Link waiting processors into a chain

(across the processors)– Each keeps the id of the processor ahead of

it in the chain

• Under high contention, resource moves down the chain

• Performance:– Faster than home and arrow on counter

benchmark and some others…


How Many Data Structures?• Gribble et al claim:

– “We believe that given a small set of DDS types (such as a hash table, a tree, and an administrative log), authors will be able to build a large class of interesting and sophisticated servers.”

– Do you believe this?– What does it imply about tools vs.

libraries?


Administrivia• Gautam Kar and Joe L. Hellerstein

speaking Thursday– Papers online– Contact me about meeting with them

• Final projects: – Send mail to schedule meeting with me

• Next week:– Tuesday: guest lecture by Aaron Brown on

benchmarks; related to Kar and Hellerstein work.– Still to come: Gray, Lamport, and Liskov