Amazon Dynamo
Embed Size (px)
Transcript of Amazon Dynamo

What is this presentation all about?
We present the design and implementation
of Dynamo, a highly available key-value
storage system that some of Amazon’s core
services use to provide an “always-on”
experience. To achieve this level of
availability, Dynamo sacrifices consistency
under certain failure scenarios. It makes
extensive use of object versioning and
application-assisted conflict resolution in a
manner that provides a novel interface for
developers to use.
2

Motivation
Amazon runs a world-wide e-commerce
platform that serves tens of millions
customers at peak times using tens of
thousands of servers located in many data
centers around the world.
There are strict operational requirements on
Amazon’s platform in terms of performance,
reliability and efficiency, and to support
continuous growth the platform needs to be
highly scalable.
3

Motivation
Reliability is one of the most important
requirements because even the slightest
outage has significant financial
consequences and impacts customer trust.
Amazon uses a highly decentralized, loosely
coupled, service oriented architecture
consisting of hundreds of services.
4

Motivation
In this environment there is a particular need for storage technologies that are always available. For example, customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados. Therefore, the service responsible for managing shopping carts requires that it can always write to and read from its data store, and that its data needs to be available across multiple data centers.
5

Motivation
Dealing with failures in an infrastructure
comprised of millions of components is our
standard mode of operation; there are
always a small but significant number of
server and network components that are
failing at any given time. As such Amazon’s
software systems need to be constructed in
a manner that treats failure handling as the
normal case without impacting availability
or performance.
6

Motivation
There are many services on Amazon’s
platform that only need primary-key access
to a data store. For many services, such as
those that provide best seller lists, shopping
carts, customer preferences, session
management, sales rank, and product
catalog, the common pattern of using a
relational database would lead to
inefficiencies and limit scale and availability.
7

ASSUMPTION
Dynamo targets apps with weaker
consistency
non-hostile environment
commodity hardware
simple read/write operation to data item
uniquely identified by a key
8

ASSUMPTION
always-on key-value storage system
write: always succeeds (even incase of
failures or network partition)
read: return what ever can be found (and
may let apps deal with inconsistency)
eventual consistency. Why?
+performance +availability
9

ASSUMPTION
who resolves inconsistency?
data store: can apply simple policy like (last
write win)
app: may have better understanding how
to resolve
incremental scalability
symmetry: no node take special roles
heterogeneity
10

ASSUMPTION
always writable
security is not a concern (because of single
domain administration)
flat namespace (key-value), no relational
schema
ultimate goal is fast response (for both
read/write)
11

Service Level Agreements (SLA)
negotiated contract where a client and a service agree on several system-related characteristics, which most prominently include the client’s expected request rate distribution for a particular API and the expected service latency under those conditions.
An example of a simple SLA is a service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.
12

Service Oriented Architecture
13

System Interface
14
Dynamo stores objects associated with a key through a simple interface; it exposes two operations: get() and put().
The get(key) operation locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context.
The put(key, context, object) operation determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk.

System Interface
15
Dynamo treats both the key and the object
supplied by the caller as an opaque array of
bytes.
It applies a MD5 hash on the key to
generate a 128-bit identifier, which is used to
determine the storage nodes that are
responsible for serving the key.

Partitioning (consistent hashing)
16
to distribute load across multiple storage
host each node is assigned a number
(position in the ring)
hashing data item's key to yield its position
on the ring, and walk the ring clock-wise to
find the first node with a position larger than
the item's position
each node responsible for the region in the
ring between it and its predecessor node on
the ring

Partitioning (consistent hashing)
17

Partitioning (consistent hashing)
18
non-uniform data and load distribution because we assign each node a random position
the algorithm is oblivious to heterogeneity in the performance
use virtual nodes: each node is assigned to multiple locations
if a physical node becomes unavailable, the load handled by this node is evenly dispersed across remaining available nodes.
deal with heterogeneity

Replication
19
for high availability and durability
data item is stored at a node and
replicated in its next N successors
with virtual nodes, the first N successor
positions for a particular key may be owned
by less than N distinct physical nodes hence,
when building this list, skip positions to make
sure the list contains only distinct physical
nodes

Data Versioning
20
for high availability for write
write can return before update propagate hence, eventual consistency
but, subsequent get() may return data that is not the latest
treat result of each modification as a new and immutable version of data
allow multiple version of objects at the system, resolve later

Data Versioning
21
most of the time, resolve is simple, new
version subsumes previous ones
In some scenarios we end up with
conflicting versions, in this case client makes
semantic reconciliation: collapse multiple
branches of data evolution back in to one.
For example "merging" different versions of a
customer's shopping cart.

Resolve Inconsistency
22
use vector clocks (list of <node, counter>) to
define causality between version
system level: if V1 <= V2, then V2 is newer
and V1 can be discarded
otherwise, V1 and V2 are parallel branches,
need to resolve at app level (e.g, merge)
when read
When read, all versions of data are
presented to client for semantic resolve

Resolve Inconsistency
23

Resolve Inconsistency
24
size of vector can grow too large
store a timestamp for each pair, indicate
the last time the node updated the data
item
when need to truncate, choose the oldest
pair
just quick fix, and may loose causal
relationship, make it hard to reconcile but
they claimed it is ok, not a problem in
practice

Execution of put and get1st way
25
Directly contact a storage node
Dynamo library links with client code
Lower latency because of skipping a
potential forwarding step but need to
update the membership information
periodically. default: 10 seconds

Execution of put and get2nd way
26
Through a load balancer, which in turns pick
a node
Client does not have to link with dynamo
code but may have higher latency
Any storage node can receive get and put
operations for any key. If this node is not in
preference list, forward to other node in
key's preference list

Sloppy Quorum Consistency Protocol
27
read/write operations are performed on first
N healthy nodes from the preference list
R: # nodes that must participate in a
successful read operation
W: # nodes that must participate in a
successful write operation
Condition: R + W > N

Sloppy Quorum Consistency Protocol
28
What happens when put?
upon receiving put() request, the coordinator:
~ generates the vector clock for the new version
~ write the new version locally
~ sends new version (along with the vector clock) to
N highest-ranked nodes
~ Write succeeds if at least W-1 nodes return
success

Sloppy Quorum Consistency Protocol
29
What happens when get?
the coordinator receives get() request
~ the coordinator request all existing version of data for that key from N highest-ranked reachable nodes in the preference list for that key
~ then waits for R responses before returning result to the client
~ if there are multiple versions of the data, it returns all versions that it thinks to be causally unrelated
~ The client code reconciled divergent versions, and the new version is written backs

Handling Temporary Failure
30
use hinted handoff to remember temporary
unavailable real destination hence, can
deal with temporary partition or crashes
later on, when real destination comes back,
hinted handoff is used to copy replica to
that destination

Handling Temporary Failure
31

Handling Temporary Failure
32
How about handling datacenter failures?
solution: put replicas across data center by
constructing preference list such that storage nodes are spread across multiple data centers.
What bad about hinted handoff?
not so good in case of permanent failure
in above example, what if D becomes unavailable before it can return hinted replicas
to A

Handling Permanent Failure
33
hinted handoff is not effective in case of permanent failure
Solution use: anti-entropy using Merkle Tree
helps compare a lot of data that hasn't
change
minimize the amount of transfer data
each node maintain a Merkle Tree for a
particular key range

Handling Permanent Failure
34

Membership and Failure detection
35
admin make request explicitly
use gossip-based protocol to learn about
other node key ranges and membership
every interval, contact a random node, and
exchange membership information
not only membership info, the partition and
placement info is also exchanged hence, a
node can route a request directly to a
responsible node

External discovery
36
An admin contacts node A and join node A
to the ring then contacts node B and join
node B to the ring.
Nodes A and B would each consider itself a
member of the ring, yet neither be
immediately aware of the other.
use seeds to prevent logical partition
seeds are node knows to all node, and
discovered by external mechanism

Failure Detection: using time out
37
if B does not response to A message, then to
A, B failed (even B responses to C message)
If clients request a lot, A can detect B fails
quickly
If no client request, no need to detect at all
(not necessary)

Adding/Removing Storage Nodes
38
When we add new node X to the ring, some
nodes no longer have to store some of their
keys, and these nodes transfer those keys to
X.
By adding a confirmation round between
the source and the destination, it is made
sure that the destination node does not
receive any duplicate transfers for a given
key range

Adding/Removing Storage Nodes
39

Implementation
In Dynamo, each storage node has three
main software components:
Request coordination
Membership and failure detection
Local persistence engine
All these components are implemented in
Java
40

Implementation
Dynamo’s local persistence component allows for different storage engines to be plugged in:
Berkeley Database (BDB) Transactional Data Store
BDB Java Edition
MySQL
in-memory buffer with persistent backing store
41

Implementation
The request coordination component is built on top of an event- driven messaging substrate where the message processing pipeline is split into multiple stages similar to the SEDA architecture
All communications are implemented using Java NIO channels
The state machine contains all the logic for identifying the nodes responsible for a key, sending the requests, waiting for responses, potentially doing retries, processing the replies and packaging the response to the client.
42

Implementation
For instance, a read operation implements the following state machine: (i) send read requests to the nodes, (ii) wait for minimum number of required responses, (iii) if too few replies were received within a given time bound, fail the request, (iv) otherwise gather all the data versions and determine the ones to be returned and (v) if versioning is enabled, perform syntactic reconciliation and generate an opaque write context that contains the vector clock that subsumes all the remaining versions
43

Implementation
After the read response has been returned
to the caller the state machine waits for a
small period of time to receive any
outstanding responses. If stale versions were
returned in any of the responses, the
coordinator updates those nodes with the
latest version. This process is called read
repair because it repairs replicas that have
missed a recent update at an opportunistic
time and relieves the anti-entropy protocol
from having to do it
44

Implementation
write requests are coordinated by one of the top N nodes in the preference list
this approach has led to uneven load distribution resulting in SLA violations
This is because the request load is not uniformly distributed across objects
In particular, since each write usually follows a read operation, the coordinator for a write is chosen to be the node that replied fastest to the previous read operation thereby increasing the chances of getting “read-your-writes” consistency
45

Experiences andLesson Learned
We can use Dynamo with different
configuration:
Business logic specific reconciliation:
client application performs its own
reconciliation logic
For example: shopping cart service (just
merge everything)
46

Experiences andLesson Learned
Timestamp based reconciliation:
"last write win": the object with the largest
physical timestamp value is chosen as the
correct version
For example: maintain customer's session
47

Experiences andLesson Learned
High performance read engine:
some services are read mostly (require high
performance), and rarely write
R = 1, and W = N
For example: services that maintain product
catalog and promotional items
48

Experiences andLesson Learned
Client applications can tune the values of N,
R, and W to achieve their desired levels of
performance, availability and durability
The value of N determines the durability of
each object
If W is set to 1, the system will never reject a
write request as long as there is at least one
node in the system that can successfully
process a write request
49

Balancing performance and Durability
Performance of read or write operation is limited by the slowest among R and W replicas
Trade-off durability for performance
each storage node maintains an object buffer in main memory
write request is stored in the buffer, get periodically written to disk by a writer thread
read check the buffer first
50

Balancing performance and Durability
To reduce durability risk
coordinator chooses one out of the N
replicas to perform a “durable write”
it wait for only W response, it is likely that
most of the time, the performance is not
affected by the "durable write"
51

Balancing performance and Durability
52

Balancing performance and Durability
53

Ensuring Uniform Load Distribution
under high loads, a large number of popular
keys are accessed and due to uniform
distribution of keys, the load is evenly
distributed
during low loads, fewer popular keys are
accessed, resulting in higher load
imbalance
54

Ensuring Uniform Load Distribution
55

Ensuring Uniform Load Distribution
Various partition schemes and its implication on
load distribution:
Strategy 1: T random tokens per node, and
partition by token value
each node is assigned T tokens (virtual nodes)
the tokens of all nodes are ordered according
to their hash space
Every two consecutive tokens define a range
56

Ensuring Uniform Load Distribution
57

Ensuring Uniform Load Distribution
Scanning problem: Bootstrapping takes a long time:
new node joins the system, "steals" its key range from other nodes
these others node need to scan their local store
scans: expensive, resource intensive, execute in background
key ranges for many nodes changes when a node joins/leaves hence, recalculate Merkel trees for these node. expensive, too
58

Ensuring Uniform Load Distribution
Strategy 2: T random tokens per node and equal
sized partitions
hash space is divided into Q equally sized
partitions/ranges
each node is assigned T random tokens
Q is set such that: Q >> N and Q >> S * T
Tokens are only used to build the function that
maps values in the hash space to the ordered
list of nodes not to decide partitioning
59

Ensuring Uniform Load Distribution
60

Ensuring Uniform Load Distribution
Strategy 3: Q/S tokens per node, equal-sized
partitions
Q/S tokens per node
when a node leaves the system, its tokens
are randomly distributed to remaining nodes
such that these properties are preserved
when a node joins the system it "steals"
tokens from nodes in the system in a way
that preserves these properties
61

Ensuring Uniform Load Distribution
62

Ensuring Uniform Load Distribution
Strategy 3 seems to be best choice because:
Faster of bootstrapping/recovery
Since partition ranges are fixed, they can be
stored in separate files
a partition can be relocated as a unit by simply
transferring the file avoiding random accesses
needed to locate specific items
hence simplifies the process of bootstrapping
and recovery
63

Ensuring Uniform Load Distribution
Ease of archival:
simpler archival because the partition files can be archived separately
By contrast, in Strategy 1 the tokens are chosen randomly and archiving the data stored in Dynamo requires retrieving the keys from individual nodes separately and is usually inefficient and slow
changing the node membership requires coordination in order to preserve the properties required of the assignment
64

Ensuring Uniform Load Distribution
The efficiency of these three strategies is
evaluated for a system with S=30 and N=3
The load balancing efficiency of each
strategy was measured for different sizes of
membership information that needs to be
maintained at each node, where Load
balancing efficiency is defined as the ratio
of average number of requests served by
each node to the maximum number of
requests served by the hottest node.
65

Ensuring Uniform Load Distribution
66

Divergent Versions: When and How many
In our next experiment, the number of
versions returned to the shopping cart
service was profiled for a period of 24 hours.
During this period, 99.94% of requests saw
exactly one version; 0.00057% of requests
saw 2 versions; 0.00047% of requests saw 3
versions and 0.00009% of requests saw 4
versions. This shows that divergent versions
are created rarely.
67

Divergent Versions: When and How many
When?
failure: node failure, data center failure,
network partition
large number of concurrent writers to a
single data item, handled by multiple nodes
Resolve:
by vector clocks
and if not, by client apps
68

Client-driven or Server-driven Coordination
2 way:
directly contact a storage node:
dynamo library links with client code
lower latency because of skipping a potential
forwarding step but need to update the
membership information periodically
default: 10 seconds
see stale membership for duration of 10 seconds
69

Client-driven or Server-driven Coordination
through a load balancer, which in turns pick
a node
client does not have to link with dynamo
code
but may have higher latency
70

Client-driven or Server-driven Coordination
71

Balancing background vs. foreground task
foreground: get/put
background:
data handoff (due to hinting or node leaves/joins)
replica synchronization (due to permanent failure)
Challenges: background activity does not affect
foreground task
Solution: monitoring foreground operation, enable
background task when appropriate
72

Final Words
Many Amazon internal services have used Dynamo for the past two years and it has provided significant levels of availability to its applications
In particular, applications have received successful responses (without timing out) for 99.9995% of its requests and no data loss event has occurred to date.
Moreover, the primary advantage of Dynamo is that it provides the necessary knobs using the three parameters of (N,R,W) to tune their instance based on their needs
73

Final Words
Unlike popular commercial data stores,
Dynamo exposes data consistency and
reconciliation logic issues to the developers
Dynamo adopts a full membership model
where each node is aware of the data
hosted by its peers
Exchanging membership table (routing
table) works for hundreds nodes but may
not scale to thousands node.
74

