Amazon Dynamo

of 76 /76
Amazon Dynamo Ali S. Bilal ( bilal@ ) University of Tehran Fall 2014

Embed Size (px)

Transcript of Amazon Dynamo

Page 1: Amazon Dynamo

Amazon Dynamo

Ali S. Bilal ([email protected])

University of Tehran

Fall 2014

Page 2: Amazon Dynamo

What is this presentation all about?

We present the design and implementation

of Dynamo, a highly available key-value

storage system that some of Amazon’s core

services use to provide an “always-on”

experience. To achieve this level of

availability, Dynamo sacrifices consistency

under certain failure scenarios. It makes

extensive use of object versioning and

application-assisted conflict resolution in a

manner that provides a novel interface for

developers to use.


Page 3: Amazon Dynamo


Amazon runs a world-wide e-commerce

platform that serves tens of millions

customers at peak times using tens of

thousands of servers located in many data

centers around the world.

There are strict operational requirements on

Amazon’s platform in terms of performance,

reliability and efficiency, and to support

continuous growth the platform needs to be

highly scalable.


Page 4: Amazon Dynamo


Reliability is one of the most important

requirements because even the slightest

outage has significant financial

consequences and impacts customer trust.

Amazon uses a highly decentralized, loosely

coupled, service oriented architecture

consisting of hundreds of services.


Page 5: Amazon Dynamo


In this environment there is a particular need for storage technologies that are always available. For example, customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados. Therefore, the service responsible for managing shopping carts requires that it can always write to and read from its data store, and that its data needs to be available across multiple data centers.


Page 6: Amazon Dynamo


Dealing with failures in an infrastructure

comprised of millions of components is our

standard mode of operation; there are

always a small but significant number of

server and network components that are

failing at any given time. As such Amazon’s

software systems need to be constructed in

a manner that treats failure handling as the

normal case without impacting availability

or performance.


Page 7: Amazon Dynamo


There are many services on Amazon’s

platform that only need primary-key access

to a data store. For many services, such as

those that provide best seller lists, shopping

carts, customer preferences, session

management, sales rank, and product

catalog, the common pattern of using a

relational database would lead to

inefficiencies and limit scale and availability.


Page 8: Amazon Dynamo


Dynamo targets apps with weaker


non-hostile environment

commodity hardware

simple read/write operation to data item

uniquely identified by a key


Page 9: Amazon Dynamo


always-on key-value storage system

write: always succeeds (even incase of

failures or network partition)

read: return what ever can be found (and

may let apps deal with inconsistency)

eventual consistency. Why?

+performance +availability


Page 10: Amazon Dynamo


who resolves inconsistency?

data store: can apply simple policy like (last

write win)

app: may have better understanding how

to resolve

incremental scalability

symmetry: no node take special roles



Page 11: Amazon Dynamo


always writable

security is not a concern (because of single

domain administration)

flat namespace (key-value), no relational


ultimate goal is fast response (for both



Page 12: Amazon Dynamo

Service Level Agreements (SLA)

negotiated contract where a client and a service agree on several system-related characteristics, which most prominently include the client’s expected request rate distribution for a particular API and the expected service latency under those conditions.

An example of a simple SLA is a service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.


Page 13: Amazon Dynamo

Service Oriented Architecture


Page 14: Amazon Dynamo

System Interface


Dynamo stores objects associated with a key through a simple interface; it exposes two operations: get() and put().

The get(key) operation locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context.

The put(key, context, object) operation determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk.

Page 15: Amazon Dynamo

System Interface


Dynamo treats both the key and the object

supplied by the caller as an opaque array of


It applies a MD5 hash on the key to

generate a 128-bit identifier, which is used to

determine the storage nodes that are

responsible for serving the key.

Page 16: Amazon Dynamo

Partitioning (consistent hashing)


to distribute load across multiple storage

host each node is assigned a number

(position in the ring)

hashing data item's key to yield its position

on the ring, and walk the ring clock-wise to

find the first node with a position larger than

the item's position

each node responsible for the region in the

ring between it and its predecessor node on

the ring

Page 17: Amazon Dynamo

Partitioning (consistent hashing)


Page 18: Amazon Dynamo

Partitioning (consistent hashing)


non-uniform data and load distribution because we assign each node a random position

the algorithm is oblivious to heterogeneity in the performance

use virtual nodes: each node is assigned to multiple locations

if a physical node becomes unavailable, the load handled by this node is evenly dispersed across remaining available nodes.

deal with heterogeneity

Page 19: Amazon Dynamo



for high availability and durability

data item is stored at a node and

replicated in its next N successors

with virtual nodes, the first N successor

positions for a particular key may be owned

by less than N distinct physical nodes hence,

when building this list, skip positions to make

sure the list contains only distinct physical


Page 20: Amazon Dynamo

Data Versioning


for high availability for write

write can return before update propagate hence, eventual consistency

but, subsequent get() may return data that is not the latest

treat result of each modification as a new and immutable version of data

allow multiple version of objects at the system, resolve later

Page 21: Amazon Dynamo

Data Versioning


most of the time, resolve is simple, new

version subsumes previous ones

In some scenarios we end up with

conflicting versions, in this case client makes

semantic reconciliation: collapse multiple

branches of data evolution back in to one.

For example "merging" different versions of a

customer's shopping cart.

Page 22: Amazon Dynamo

Resolve Inconsistency


use vector clocks (list of <node, counter>) to

define causality between version

system level: if V1 <= V2, then V2 is newer

and V1 can be discarded

otherwise, V1 and V2 are parallel branches,

need to resolve at app level (e.g, merge)

when read

When read, all versions of data are

presented to client for semantic resolve

Page 23: Amazon Dynamo

Resolve Inconsistency


Page 24: Amazon Dynamo

Resolve Inconsistency


size of vector can grow too large

store a timestamp for each pair, indicate

the last time the node updated the data


when need to truncate, choose the oldest


just quick fix, and may loose causal

relationship, make it hard to reconcile but

they claimed it is ok, not a problem in


Page 25: Amazon Dynamo

Execution of put and get1st way


Directly contact a storage node

Dynamo library links with client code

Lower latency because of skipping a

potential forwarding step but need to

update the membership information

periodically. default: 10 seconds

Page 26: Amazon Dynamo

Execution of put and get2nd way


Through a load balancer, which in turns pick

a node

Client does not have to link with dynamo

code but may have higher latency

Any storage node can receive get and put

operations for any key. If this node is not in

preference list, forward to other node in

key's preference list

Page 27: Amazon Dynamo

Sloppy Quorum Consistency Protocol


read/write operations are performed on first

N healthy nodes from the preference list

R: # nodes that must participate in a

successful read operation

W: # nodes that must participate in a

successful write operation

Condition: R + W > N

Page 28: Amazon Dynamo

Sloppy Quorum Consistency Protocol


What happens when put?

upon receiving put() request, the coordinator:

~ generates the vector clock for the new version

~ write the new version locally

~ sends new version (along with the vector clock) to

N highest-ranked nodes

~ Write succeeds if at least W-1 nodes return


Page 29: Amazon Dynamo

Sloppy Quorum Consistency Protocol


What happens when get?

the coordinator receives get() request

~ the coordinator request all existing version of data for that key from N highest-ranked reachable nodes in the preference list for that key

~ then waits for R responses before returning result to the client

~ if there are multiple versions of the data, it returns all versions that it thinks to be causally unrelated

~ The client code reconciled divergent versions, and the new version is written backs

Page 30: Amazon Dynamo

Handling Temporary Failure


use hinted handoff to remember temporary

unavailable real destination hence, can

deal with temporary partition or crashes

later on, when real destination comes back,

hinted handoff is used to copy replica to

that destination

Page 31: Amazon Dynamo

Handling Temporary Failure


Page 32: Amazon Dynamo

Handling Temporary Failure


How about handling datacenter failures?

solution: put replicas across data center by

constructing preference list such that storage nodes are spread across multiple data centers.

What bad about hinted handoff?

not so good in case of permanent failure

in above example, what if D becomes unavailable before it can return hinted replicas

to A

Page 33: Amazon Dynamo

Handling Permanent Failure


hinted handoff is not effective in case of permanent failure

Solution use: anti-entropy using Merkle Tree

helps compare a lot of data that hasn't


minimize the amount of transfer data

each node maintain a Merkle Tree for a

particular key range

Page 34: Amazon Dynamo

Handling Permanent Failure


Page 35: Amazon Dynamo

Membership and Failure detection


admin make request explicitly

use gossip-based protocol to learn about

other node key ranges and membership

every interval, contact a random node, and

exchange membership information

not only membership info, the partition and

placement info is also exchanged hence, a

node can route a request directly to a

responsible node

Page 36: Amazon Dynamo

External discovery


An admin contacts node A and join node A

to the ring then contacts node B and join

node B to the ring.

Nodes A and B would each consider itself a

member of the ring, yet neither be

immediately aware of the other.

use seeds to prevent logical partition

seeds are node knows to all node, and

discovered by external mechanism

Page 37: Amazon Dynamo

Failure Detection: using time out


if B does not response to A message, then to

A, B failed (even B responses to C message)

If clients request a lot, A can detect B fails


If no client request, no need to detect at all

(not necessary)

Page 38: Amazon Dynamo

Adding/Removing Storage Nodes


When we add new node X to the ring, some

nodes no longer have to store some of their

keys, and these nodes transfer those keys to


By adding a confirmation round between

the source and the destination, it is made

sure that the destination node does not

receive any duplicate transfers for a given

key range

Page 39: Amazon Dynamo

Adding/Removing Storage Nodes


Page 40: Amazon Dynamo


In Dynamo, each storage node has three

main software components:

Request coordination

Membership and failure detection

Local persistence engine

All these components are implemented in



Page 41: Amazon Dynamo


Dynamo’s local persistence component allows for different storage engines to be plugged in:

Berkeley Database (BDB) Transactional Data Store

BDB Java Edition


in-memory buffer with persistent backing store


Page 42: Amazon Dynamo


The request coordination component is built on top of an event- driven messaging substrate where the message processing pipeline is split into multiple stages similar to the SEDA architecture

All communications are implemented using Java NIO channels

The state machine contains all the logic for identifying the nodes responsible for a key, sending the requests, waiting for responses, potentially doing retries, processing the replies and packaging the response to the client.


Page 43: Amazon Dynamo


For instance, a read operation implements the following state machine: (i) send read requests to the nodes, (ii) wait for minimum number of required responses, (iii) if too few replies were received within a given time bound, fail the request, (iv) otherwise gather all the data versions and determine the ones to be returned and (v) if versioning is enabled, perform syntactic reconciliation and generate an opaque write context that contains the vector clock that subsumes all the remaining versions


Page 44: Amazon Dynamo


After the read response has been returned

to the caller the state machine waits for a

small period of time to receive any

outstanding responses. If stale versions were

returned in any of the responses, the

coordinator updates those nodes with the

latest version. This process is called read

repair because it repairs replicas that have

missed a recent update at an opportunistic

time and relieves the anti-entropy protocol

from having to do it


Page 45: Amazon Dynamo


write requests are coordinated by one of the top N nodes in the preference list

this approach has led to uneven load distribution resulting in SLA violations

This is because the request load is not uniformly distributed across objects

In particular, since each write usually follows a read operation, the coordinator for a write is chosen to be the node that replied fastest to the previous read operation thereby increasing the chances of getting “read-your-writes” consistency


Page 46: Amazon Dynamo

Experiences andLesson Learned

We can use Dynamo with different


Business logic specific reconciliation:

client application performs its own

reconciliation logic

For example: shopping cart service (just

merge everything)


Page 47: Amazon Dynamo

Experiences andLesson Learned

Timestamp based reconciliation:

"last write win": the object with the largest

physical timestamp value is chosen as the

correct version

For example: maintain customer's session


Page 48: Amazon Dynamo

Experiences andLesson Learned

High performance read engine:

some services are read mostly (require high

performance), and rarely write

R = 1, and W = N

For example: services that maintain product

catalog and promotional items


Page 49: Amazon Dynamo

Experiences andLesson Learned

Client applications can tune the values of N,

R, and W to achieve their desired levels of

performance, availability and durability

The value of N determines the durability of

each object

If W is set to 1, the system will never reject a

write request as long as there is at least one

node in the system that can successfully

process a write request


Page 50: Amazon Dynamo

Balancing performance and Durability

Performance of read or write operation is limited by the slowest among R and W replicas

Trade-off durability for performance

each storage node maintains an object buffer in main memory

write request is stored in the buffer, get periodically written to disk by a writer thread

read check the buffer first


Page 51: Amazon Dynamo

Balancing performance and Durability

To reduce durability risk

coordinator chooses one out of the N

replicas to perform a “durable write”

it wait for only W response, it is likely that

most of the time, the performance is not

affected by the "durable write"


Page 52: Amazon Dynamo

Balancing performance and Durability


Page 53: Amazon Dynamo

Balancing performance and Durability


Page 54: Amazon Dynamo

Ensuring Uniform Load Distribution

under high loads, a large number of popular

keys are accessed and due to uniform

distribution of keys, the load is evenly


during low loads, fewer popular keys are

accessed, resulting in higher load



Page 55: Amazon Dynamo

Ensuring Uniform Load Distribution


Page 56: Amazon Dynamo

Ensuring Uniform Load Distribution

Various partition schemes and its implication on

load distribution:

Strategy 1: T random tokens per node, and

partition by token value

each node is assigned T tokens (virtual nodes)

the tokens of all nodes are ordered according

to their hash space

Every two consecutive tokens define a range


Page 57: Amazon Dynamo

Ensuring Uniform Load Distribution


Page 58: Amazon Dynamo

Ensuring Uniform Load Distribution

Scanning problem: Bootstrapping takes a long time:

new node joins the system, "steals" its key range from other nodes

these others node need to scan their local store

scans: expensive, resource intensive, execute in background

key ranges for many nodes changes when a node joins/leaves hence, recalculate Merkel trees for these node. expensive, too


Page 59: Amazon Dynamo

Ensuring Uniform Load Distribution

Strategy 2: T random tokens per node and equal

sized partitions

hash space is divided into Q equally sized


each node is assigned T random tokens

Q is set such that: Q >> N and Q >> S * T

Tokens are only used to build the function that

maps values in the hash space to the ordered

list of nodes not to decide partitioning


Page 60: Amazon Dynamo

Ensuring Uniform Load Distribution


Page 61: Amazon Dynamo

Ensuring Uniform Load Distribution

Strategy 3: Q/S tokens per node, equal-sized


Q/S tokens per node

when a node leaves the system, its tokens

are randomly distributed to remaining nodes

such that these properties are preserved

when a node joins the system it "steals"

tokens from nodes in the system in a way

that preserves these properties


Page 62: Amazon Dynamo

Ensuring Uniform Load Distribution


Page 63: Amazon Dynamo

Ensuring Uniform Load Distribution

Strategy 3 seems to be best choice because:

Faster of bootstrapping/recovery

Since partition ranges are fixed, they can be

stored in separate files

a partition can be relocated as a unit by simply

transferring the file avoiding random accesses

needed to locate specific items

hence simplifies the process of bootstrapping

and recovery


Page 64: Amazon Dynamo

Ensuring Uniform Load Distribution

Ease of archival:

simpler archival because the partition files can be archived separately

By contrast, in Strategy 1 the tokens are chosen randomly and archiving the data stored in Dynamo requires retrieving the keys from individual nodes separately and is usually inefficient and slow

changing the node membership requires coordination in order to preserve the properties required of the assignment


Page 65: Amazon Dynamo

Ensuring Uniform Load Distribution

The efficiency of these three strategies is

evaluated for a system with S=30 and N=3

The load balancing efficiency of each

strategy was measured for different sizes of

membership information that needs to be

maintained at each node, where Load

balancing efficiency is defined as the ratio

of average number of requests served by

each node to the maximum number of

requests served by the hottest node.


Page 66: Amazon Dynamo

Ensuring Uniform Load Distribution


Page 67: Amazon Dynamo

Divergent Versions: When and How many

In our next experiment, the number of

versions returned to the shopping cart

service was profiled for a period of 24 hours.

During this period, 99.94% of requests saw

exactly one version; 0.00057% of requests

saw 2 versions; 0.00047% of requests saw 3

versions and 0.00009% of requests saw 4

versions. This shows that divergent versions

are created rarely.


Page 68: Amazon Dynamo

Divergent Versions: When and How many


failure: node failure, data center failure,

network partition

large number of concurrent writers to a

single data item, handled by multiple nodes


by vector clocks

and if not, by client apps


Page 69: Amazon Dynamo

Client-driven or Server-driven Coordination

2 way:

directly contact a storage node:

dynamo library links with client code

lower latency because of skipping a potential

forwarding step but need to update the

membership information periodically

default: 10 seconds

see stale membership for duration of 10 seconds


Page 70: Amazon Dynamo

Client-driven or Server-driven Coordination

through a load balancer, which in turns pick

a node

client does not have to link with dynamo


but may have higher latency


Page 71: Amazon Dynamo

Client-driven or Server-driven Coordination


Page 72: Amazon Dynamo

Balancing background vs. foreground task

foreground: get/put


data handoff (due to hinting or node leaves/joins)

replica synchronization (due to permanent failure)

Challenges: background activity does not affect

foreground task

Solution: monitoring foreground operation, enable

background task when appropriate


Page 73: Amazon Dynamo

Final Words

Many Amazon internal services have used Dynamo for the past two years and it has provided significant levels of availability to its applications

In particular, applications have received successful responses (without timing out) for 99.9995% of its requests and no data loss event has occurred to date.

Moreover, the primary advantage of Dynamo is that it provides the necessary knobs using the three parameters of (N,R,W) to tune their instance based on their needs


Page 74: Amazon Dynamo

Final Words

Unlike popular commercial data stores,

Dynamo exposes data consistency and

reconciliation logic issues to the developers

Dynamo adopts a full membership model

where each node is aware of the data

hosted by its peers

Exchanging membership table (routing

table) works for hundreds nodes but may

not scale to thousands node.


Page 75: Amazon Dynamo
Page 76: Amazon Dynamo