7. Key-Value Databases: In Depth
-
Upload
fabio-fumarola -
Category
Data & Analytics
-
view
3.165 -
download
1
Transcript of 7. Key-Value Databases: In Depth
![Page 1: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/1.jpg)
Key-Value DatabasesIn Depth
Ciaociao
Vai a fare
ciao ciaoDr. Fabio Fumarola
![Page 2: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/2.jpg)
Outline• Key-values introduction• Major Key-Value Databases• Dynamo DB: How is implemented
– Background– Partitioning: Consistent Hashing– High Availability for writes: Vector Clocks– Handling temporary failures: Sloppy Quorum– Recovering from failures: Merkle Trees– Membership and failure detection: Gossip Protocol
2
![Page 3: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/3.jpg)
Key-Value Databases• A key-value store is a simple hash table• Where all the accesses to the database are via
primary keys.• A client can either:
– Get the value for a key– Put a value for a key– Delete a key from the data store.
3
![Page 4: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/4.jpg)
Key-value store: characteristics• Key-value data access enable high performance and
availability.• Both keys and values can be complex compound
objects and sometime lists, maps or other data structures.
• Consistency is applicable only for operations on a single key (eventually-consistency).
4
![Page 5: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/5.jpg)
Key-Values: Cons
• No complex query filters• All joins must be done in code• No foreign key constraints• No trigger
5
![Page 6: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/6.jpg)
Key-Values: Pros• Efficient queries (very predictable performance).• Easy to distribute across a cluster.• Service-orientation disallows foreign key constraints
and forces joins to be done in code anyway.• Using a relational DB + Cache forces into a key-value
storage anyway• No object-relational miss-match
6
![Page 7: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/7.jpg)
Popular Key-Value Stores• Riak Basho• Redis – Data Structure server• Memcached DB• Berkeley DB – Oracle • Aerospike – fast key-value for SSD disks• LevelDB – Google key-value store• DynamoDB – Amazon key-value store• VoltDB – Open Source Amazon replica
7
![Page 8: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/8.jpg)
Memcached DB• Atomic operations set/get/delete.• O(1) to set/get/delete.• Consistent hashing.• In memory caching, no persistence.• LRU eviction policy.• No iterators.
8
![Page 9: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/9.jpg)
Aerospike• Key-Value database optimized for hybrid (DRAM + Flash)
approach• First published in the Proceedings of VLDB (Very Large
Databases) in 2011, “Citrusleaf: A Real-Time NoSQL DB which Preserves ACID”
9
![Page 10: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/10.jpg)
Redis• Written C++ with BSD License• It is an advanced key-value store.• Keys can contain strings, hashes, lists, sets, sorted sets,
bitmaps and hyperloglogs.• It works with an in-memory. • data can be persisted either by dumping the dataset to disk
every once in a while, or by appending each command to a log.
• Created by Salvatore Sanfilippo (Pivotal)
10
![Page 11: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/11.jpg)
Riak• Distributed Database written in: Erlang & C, some JavaScript• Operations
– GET /buckets/BUCKET/keys/KEY– PUT|POST /buckets/BUCKET/keys/KEY– DELETE /buckets/BUCKET/keys/KEY
• Integrated with Solr and MapReduce• Data Types: basic, Sets and Maps
11
curl -XPUT 'http://localhost:8098/riak/food/favorite' \ -H 'Content-Type:text/plain' \ -d 'pizza'
![Page 12: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/12.jpg)
LevelDBLevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.•Keys and values are arbitrary byte arrays.•Data is stored sorted by key.•The basic operations are Put(key ,value), Get(key), Delete(key).•Multiple changes can be made in one atomic batch.
Limitation•There is no client-server support built in to the library.
12
![Page 13: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/13.jpg)
DynamoDB• Peer-To-Peer key-value database.• Service Level Agreement at 99.9% percentile.• Highly available scarifying consistency• Can handle online node adds and node failures• It supports object versioning and application-assisted
conflict resolution (Eventually-Consistent Data Structures)
13
![Page 14: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/14.jpg)
DynamoAmazon’s Highly Available Key-value Store
14
![Page 15: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/15.jpg)
Amazon Dynamo DB• We analyze the design and the implementation of
Dynamo.• Amazon runs a world-wide e-commerce platform• It serves 10 millions customers• At peak times it uses 10000 servers located in many
data centers around the worlds.• The have requirements of performance, reliability
and efficiency that needs a fully scalable platform.
15
![Page 16: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/16.jpg)
Motivation of Dynamo• There are many Amazon services that only need
primary-key access to a data store– To provide best-seller lists– Shopping carts– Customer preferences– Session management– Sales rank and product catalogs
• Using relations database would lead to inefficiencies and limit scale availability
16
![Page 17: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/17.jpg)
Background
17
![Page 18: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/18.jpg)
Scalability is application dependent
• Lesson 1: the reliability and scalability of a system is dependent on how it s application state is managed.
• Amazon uses a highly decentralized, loosely couples service oriented architecture composed of hundred of services.
• They need that the storage is always available.
18
![Page 19: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/19.jpg)
Shopping carts always • Customers should be able to view and add items to
their shopping carts even if:– Disk are failing, or– A data center are being destroyed by a tornados or a
kraken.
19
![Page 20: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/20.jpg)
Failures Happens• When you deal with an infrastructure composed by
million of component servers and network components crashes.
20
http://letitcrash.com/
![Page 21: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/21.jpg)
High Availability by contract• Service Level Agreement (SLA) is the guarantee that
an application can deliver its functionality in a bounded time.
• An example of SLA is to guarantee that the Acme API provide a response within 300ms for 99.9% of its requests for a peak of 500 concurrent users (CCU).
• Normally SLA is described using average, median and expected variance.
21
![Page 22: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/22.jpg)
Dynamo DBIt uses a synthesis of well known techniques to achieve scalability and availability.
1.Data is partitioned and replicated using consistent hashing [Karger et al. 1997].
2.Consistency if facilitated by version clock and object versioning [Lamport 1978]
3.Consistency among replicas is maintained by a decentralized replica synchronization protocol (E-CRDT).
4.Gossip protocol is used for membership and failure detection.
22
![Page 23: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/23.jpg)
System Interface• Dynamo stores objects associated with a key through
two operations: get() and put()– The get(key) locates the object replicas associated with
the key in the storage and returns a single object or a list of objects with conflicting versions along with a context.
– The put(key, context, object) operation determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk.
– The context encodes system metadata about the object
23
![Page 24: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/24.jpg)
Key and Value encoding• Dynamo treats both the key and the object supplied
by the caller as an opaque array of bytes. • It applies a MD5 hash on the key to generate a 128-
bit identifier, which is used to determine the storage nodes that are responsible for serving the key.
24
![Page 25: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/25.jpg)
Dynamo Architectural Choice 1/2We focus on the core of distributed systems techniques used
25
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability
High Availability for writes Vector clocks with reconciliation during reads
Version size is decoupled from update rates.
Handling temporary failures
Sloppy Quorum and hinted handoff
Provides high availability and durability guarantee when some of the replicas are not available
![Page 26: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/26.jpg)
Dynamo Architectural Choice 2/2We focus on the core of distributed systems techniques used
26
Problem Technique Advantage
Recovering from permanent failures
Anti-entropy using Merkle trees
Synchronizes divergent replicas in the background.
Membership and failure detection
Gossip-based membership protocol and failure detection
Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.
![Page 27: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/27.jpg)
Partitioning: Consistent Hashing• Dynamo musts scale incrementally.• This requires a mechanism to dynamically partition
the data over the set of nodes (i.e., storage hosts) in the system.
• Dynamo’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts.
• the output range of a hash function is treated as a fixed circular space or ring
27
![Page 28: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/28.jpg)
Partitioning: Consistent Hashing• Each node in the system is assigned a random value
within this space which represents its “position” on the ring.
• Each data item is assigned to a node by: 1. hashing the data item’s key to yield its position on the
ring,
2. and then walking the ring clockwise to find the first node with a position larger than the item’s position.
28
![Page 29: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/29.jpg)
Partitioning: Consistent Hashing• each node becomes
responsible for the region in the ring between it and its predecessor node on the ring
• The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.
29
![Page 30: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/30.jpg)
Consistent Hashing: Idea• Consistent hashing is a technique that lets you
smoothly handle these problems:1. Given a resource key and a list of servers, how do you
find a primary, second, tertiary (and on down the line) server for the resource?
2. If you have different size servers, how do you assign each of them an amount of work that corresponds to their capacity?
30
![Page 31: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/31.jpg)
Consistent Hashing: Idea• Consistent hashing is a technique that lets you
smoothly handle these problems:3. How do you smoothly add capacity to the system without
downtime?
4. Specifically, this means solving two problems:• How do you avoid dumping 1/N of the total load on a new server
as soon as you turn it on?• How do you avoid rehashing more existing keys than necessary?
31
![Page 32: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/32.jpg)
Consistent Hashing: How To
• Imagine a 128-bit space.
• visualize it as a ring, or a clock face
• Now imagine hashing resources into points on the circle
32
![Page 33: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/33.jpg)
Consistent Hashing: How To
• They could be URLs, GUIDs, integer IDs, or any arbitrary sequence of bytes.
• Just run them through a good hash function (eg, MD5) and shave off everything but 16 bytes.
• We have four key-values: 1, 2, 3, 4.
33
![Page 34: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/34.jpg)
Consistent Hashing: How To
• Finally, imagine our servers.– A,– B, and– C
• We put our servers in the same ring.
• We solved the problem of which server should user Resource 2
34
![Page 35: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/35.jpg)
Consistent Hashing: How To
• We start where resource 2 is and, head clockwise on the ring until we hit a server.
• If that server is down, we go to the next one, and so on and so forth
35
![Page 36: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/36.jpg)
Consistent Hashing: How To
• Key-value 4 and 1 belong to the server A
• Key-value 2 to the server B• Key-value 3 to the server C
36
![Page 37: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/37.jpg)
Consistent Hashing: Del Server
• If the server C is removed• Key-value 3 now belongs to
the server A• All the other key-values
mapping are unchanged
37
![Page 38: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/38.jpg)
Consistent Hashing: Add Server
• If server D is added in the position marked
• What are the object that will belongs to D?
38
![Page 39: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/39.jpg)
Consistent Hashing: Cons
• This works well, except the size of the intervals assigned to each cache is pretty hit and miss.
• Since it is essentially random it is possible to have a very non-uniform distribution of objects between caches.
• To address this issue it is introduced the idea of "virtual nodes”
39
![Page 40: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/40.jpg)
Consistent Hashing: Virtual Nodes
• Instead of mapping a server to a single point in the circle, each server gets assigned to multiple points in the ring.
• A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node.
• Effectively, when a new node is added to the system, it is assigned multiple positions in the ring.
40
![Page 41: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/41.jpg)
Virtual Nodes: Advantages• If a node becomes unavailable (due to failures or routine
maintenance), the load handled by this node is evenly dispersed across the remaining available nodes.
• When a node becomes available again, or a new node is added to the system, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.
• The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.
41
![Page 42: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/42.jpg)
Data Replication• To achieve high availability and durability, Dynamo
replicates its data on multiple hosts. • Each data item is replicated at N hosts, where N is a
parameter configured “per-instance”.• Each key k is assigned to a coordinator node
(described above).• The coordinator is in charge of the replication of the
data items that fall within its range (ring).
42
![Page 43: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/43.jpg)
Data Replication
• The coordinator locally store each key within its range,
• And in addition, it replicates these keys at the N-1 clockwise successor nodes in the ring.
43
![Page 44: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/44.jpg)
Data Replication• The list of nodes that is responsible for storing a particular key
is called the preference list • The system is designed so that every node in the system can
determine which nodes should be in this list for any particular key.
• To account for node failures, preference list contains more than N nodes.
• To avoid that with “virtual nodes” a key k is owned by less than N physical nodes, the preference list skips some nodes.
44
![Page 45: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/45.jpg)
High Availability for writes• With eventual consistency writes are propagated
asynchronously.• A put() may return to its caller before the update has
been applied at all the replicas.• In this scenarios where a subsequent get() operation
may return an object that does not have the latest updates.
45
![Page 46: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/46.jpg)
High Availability for writes: Example
• We can see this event with shopping carts.• The “Add to Cart” operation can never be forgotten
or rejected. • When a customer wants to add an item to (or
remove from) a shopping cart and the latest version is not available, the item is added to (or removed from) the older version and the divergent versions are reconciled later.
• Question!46
![Page 47: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/47.jpg)
High Availability for writes• Dynamo treats the result of each modification as a
new and immutable version of the data. • It allows for multiple versions of an object to be
present in the system at the same time. • Most of the time, new versions subsume the
previous version(s), and the system itself can determine the authoritative version (syntactic reconciliation).
47
![Page 48: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/48.jpg)
Singly-Linked ListSTART
48
![Page 49: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/49.jpg)
Singly-Linked List
49
3355 77
ConsCons
NilNilabstract sealed class List { def head: Int def tail: List def isEmpty: Boolean}
case object Nil extends List { def head: Int = fail("Empty list.") def tail: List = fail("Empty list.") def isEmpty: Boolean = true}
case class Cons(head: Int, tail: List = Nil) extends List { def isEmpty: Boolean = false}
![Page 50: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/50.jpg)
List: analysis
50
3355 77A =
B = Cons(9, A) = 99
C = Cons(1, Cons(8, B)) = 11 88
structural sharingstructural sharing
![Page 51: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/51.jpg)
/** * Time - O(1) * Space - O(1) */def prepend(x: Int): List = Cons(x, this)
/** * Time - O(n) * Space - O(n) */ def append(x: Int): List = if (isEmpty) Cons(x) else Cons(head, tail.append(x))
List: append & prepend
51
3355 7799
3355 77 99
![Page 52: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/52.jpg)
List: apply
52
3355 77 4422 66
n - 1
/** * Time - O(n) * Space - O(n) */def apply(n: Int): A = if (isEmpty) fail("Index out of bounds.") else if (n == 0) head else tail(n - 1) // or tail.apply(n - 1)
![Page 53: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/53.jpg)
List: concat
53
path copyingpath copying
A = 4422 66
B = 3355 77
C = A.concat(B) = 4422 66
/** * Time - O(n) * Space - O(n) */def concat(xs: List): List = if (isEmpty) xs else tail.concat(xs).prepend(head)
![Page 54: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/54.jpg)
List: reverse (two approaches)
54
4422 66 4466 22reverse( ) =
def reverse: List = if (isEmpty) Nil else tail.reverse.append(head)
, or tail recursion in O(n)
The straightforward solution in O(n2)
def reverse: List = { @tailrec def loop(s: List, d: List): List = if (s.isEmpty) d else loop(s.tail, d.prepend(s.head)) loop(this, Nil)}
![Page 55: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/55.jpg)
List performance
55
![Page 56: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/56.jpg)
Singly-Linked ListEND
56
![Page 57: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/57.jpg)
High Availability for writes• Failure in nodes can potentially result in the system
having not just two but several versions of the same data.
• Updates in the presence of network partitions and node failures can potentially result in an object having distinct version sub-histories.
57
![Page 58: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/58.jpg)
High Availability for writes• Dynamo uses vector clocks in order to capture
causality between different versions of the same object.
• One vector clock is associated with every version of every object
• We can determine whether two versions of an object are on parallel branches or have a causal ordering, by examine their vector clocks
58
![Page 59: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/59.jpg)
High Availability for writes• When dealing with different copy of the same object:– If the counters on the first object’s clock are less-than-or-
equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.
– Otherwise, the two changes are considered to be in conflict and require reconciliation.
59
![Page 60: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/60.jpg)
HA with Vectors Clock• Vector Clock is an algorithm for generating a partial
ordering of events in a distributed system and detecting causality violations.
• They are based on logical timestamp, otherwise known as a Lamport Clock.
• A Lamport Clock is a single integer value that is passed around the cluster with every message sent between nodes.
60
![Page 61: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/61.jpg)
HA with Vectors Clock• Events in the blue region are the causes leading to event B4,
whereas those in the red region are the effects of event B4
61
![Page 62: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/62.jpg)
HA with Vectors Clock• Each node keeps a record of what it thinks the latest (i.e.
highest) Lamport Clock value is, and if it hears a larger value from some other node, it updates its own value.
• Every time a database record is produced, the producing node can attach the current Lamport Clock value + 1 to it as a timestamp.
• This sets up a total ordering on all records with the valuable property that if record A may causally precede record B, then A's timestamp < B's timestamp.
62
![Page 63: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/63.jpg)
Example Vector Clock: Dynamo
63
![Page 64: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/64.jpg)
Execution of get() and put()• Each read and write is in charge of a coordinator.• Typically, this is the first among the top N nodes in
the preference list • Read and write operations involve the first N healthy
nodes in the preference list, skipping over those that are down or inaccessible.
64
![Page 65: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/65.jpg)
Handling temporary failures• To handle this kind of failures Dynamo uses a “sloppy
quorum”.• When there is a failure, a write is persisted on the
next available nodes in the preference list.• The replica sent to D will have a hint in its metadata
that suggests which node was the intended recipient of the replica (in this case A).
65
![Page 66: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/66.jpg)
Handling temporary failures• Nodes that receive hinted replicas will keep them in
a separate local database that is scanned periodically.
• Upon detecting that A has recovered, D will attempt to deliver the replica to A.
• Once the transfer succeeds, D may delete the object from its local store without decreasing the total number of replicas in the system.
66
![Page 67: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/67.jpg)
Recovering from permanent failures
• It is a scenarios when the hinted replica become unavailable before they can be returned to the original replica node.
• To handle this and other threats to durability, Dynamo implements an anti-entropy protocol to keep the replicas synchronized.
67
![Page 68: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/68.jpg)
Recovering from permanent failures
• To detect the inconsistencies between replicas faster and to minimize the amount of transferred data, Dynamo uses Merkle trees [Merkle 1988]
• A Merkle tree is a hash tree where: – leaves are hashes of the values of individual keys. – Parent nodes higher in the tree are hashes of their respective
children.
• The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree.
68
![Page 69: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/69.jpg)
Membership and failure detection
• It depends on total failures of nodes or manual errors.
• In such cases, An administrator uses a command line tool or a browser – to connect to a Dynamo node and issue a membership
change – to join a node to a ring or – remove a node from a ring.
69
![Page 70: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/70.jpg)
Implementation• In Dynamo, each storage node has three main
software components: 1. request coordination,
2. membership and failure detection,
3. and a local persistence engine.
• All these components are implemented in Java.
70
![Page 71: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/71.jpg)
Backend Storage• Dynamo’s local persistence component allows for
different storage engines to be plugged in. • Engines that are in use
1. are Berkeley Database (BDB) Transactional Data Store2,
2. Berkeley Database Java Edition,
3. MySQL,
4. and an in-memory buffer with persistent backing store.
71
![Page 72: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/72.jpg)
Conclusions
72
![Page 73: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/73.jpg)
Dynamo Main Contributions1. It demonstrates how different techniques can be
combined to provide a single highly-available system.
2. It demonstrates that an eventually consistent storage system can be used in production with demanding applications
3. It provides insight into the tuning of these techniques.
73
![Page 74: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/74.jpg)
References1. http://diyhpl.us/~bryan/papers2/distributed/distributed-
systems/consistent-hashing.1996.pdf
2. http://www.ist-selfman.org/wiki/images/9/9f/2006-schuett-gp2pc.pdf
3. http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/
4. http://www.tom-e-white.com/2007/11/consistent-hashing.html
5. http://michaelnielsen.org/blog/consistent-hashing/
6. http://research.microsoft.com/pubs/66979/tr-2003-60.pdf
7. http://www.quora.com/Why-use-Vector-Clocks-in-a-distributed-database
74
![Page 75: 7. Key-Value Databases: In Depth](https://reader030.fdocuments.us/reader030/viewer/2022020218/55a982a51a28ab60458b4712/html5/thumbnails/75.jpg)
References8. http://basho.com/why-vector-clocks-are-easy/
9. http://en.wikipedia.org/wiki/Vector_clock
10. http://basho.com/why-vector-clocks-are-hard/
11. http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-clocks
12. https://github.com/patriknw/akka-data-replication
75