Download - Amazon`s Dynamo

Amazon`s

Presented By

Sarang Metkar

Dynamo: Highly Available Key-value Store

Giuseppe DeCandiaDeniz HastorunMadan JampaniGunavardhan Kakulapatiand others

Introduction

• Highly available and scalable distributed data store

• Flexible key – value data model

Key-Value Data Model

• Simple Key-Value pairs

• Table - collection of items

• Item - collections of attributes

Introduction

• Highly available and scalable distributed data store

• Flexible key – value data model

• Fast performance with seamless scalability

• Eventually consistent

• Decentralized system

Motivation

• ‘Always On’ experience to large customer base

• Reduce impact of failure without compromising performance

• Diverse applications with different storage and data access requirement

• Configurable to achieve stringent SLAs

SLA requirements

• Decentralized service

oriented architecture

• Multiple dependencies

hence tight constraints

• 99.9th percentile SLA

measurement

Reference : Dynamo: Amazon’s Highly Available Key-value Store : Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Alex Pilchin, Peter Vosshalland Werner Vogels

Design Considerations - Consistency

• Availability using optimistic replication - eventual consistency

• Challenges in conflict resolution

• When to resolve ?

- Always writable requirement

• Who resolves?

- Application assisted

- Data store`s “last write wins” policy

Other Design Considerations

• Incremental Scalability

• Symmetry

• Decentralization

• Heterogeneity

System Interface• Object storage and access

• get(key)

- Locate object replicas

- Return single or list of objects

• put(key, context, object)

- Determine location of replica

- Context for conflict resolution

Partition and Replication

• Consistent hashing for load and data

distribution

• Less impact of addition or deletion of

nodes

• Virtual nodes account for heterogeneity

• Coordinator node stores preference list

Eventual Consistency

• Asynchronous updates of replicas

• Versioning, based on vector clocks

• Reconciliation

• Syntactic reconciliation

• Semantic reconciliation

• Sloppy quorum like consistency

protocol

• Configurable R , W and N

[ R + W > N ]

Reference : Dynamo: Amazon’s Highly Available Key-value Store : Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Alex Pilchin, Peter Vosshalland Werner Vogels

Handling failures

• High Availability and Durability requirement

• Hinted handoff – temporary failures

• Replica synchronization – permanent failures

• Merkle Trees

- Less data transfer and faster replication

- One for each key range on node

- Recalculation of tree on key range changes

Membership and Failure Detection

• Manual addition or removal of nodes

• Gossip based protocol to reconcile membership changes

• Partitioning and node-to-token sets mapping information propagation

• Seeds avoids logical partitioning

• Decentralized failure detection

Implementation

• Local persistence engine

• Application specific

• Pluggable

• Request coordination

• Read/write request execution

• Read repair

• ‘Read-your-writes’ consistency

• Java NIO channel

• Membership and failure detection

Key Learnings - 1• Common (N,R,W) configuration – (3, 2, 2)

• Balancing Performance and Durability• Buffering of write operations

• Durable write

Key Learnings [cont`d]

• Uniform load distribution• More load imbalance for low load

• Q/S tokens per node, equal-sized partition• Faster bootstrapping and recovery

• Ease of archival

Key Learnings [cont`d]

• Divergent Versions• Failures in system

• Concurrent writes to single object by multiple nodes

• Client driven coordination• Request coordination at client

• Pull membership information

• Reduces latency

• Admission Control mechanism for background tasks

Related Work

• Peer to Peer systems

• Unstructured peer-to-peer network

• Gnutella [1]

• Freenet [2]

• Structure peer-to-peer network

• Oceanstore [3]

• Beehive [4]

• Distributed File Systems and Databases

• Google File System [5]

• Bayou [6]

Conclusion

• Application specific configuration for availability, durability, performance and consistency

• Evaluation of different techniques to build highly available system

• Use of eventually consistent storage system in production

• Tuning of various techniques to meet strict production performance requirements

References

• [1] http://www.gnutella.org/

• [2] http://freenetproject.org/

• [3] Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D.,

Gummadi, R., Rhea, S., Weatherspoon, H., Wells, C., and Zhao, B. 2000.

OceanStore: an architecture for global-scale persistent storage.

• [4] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google file system. In

Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles

• [5] Ramasubramanian, V., and Sirer, E. G. Beehive: O(1)lookup performance for

power-law query distributions in peer-to-peer overlays.

• [6] Terry, D. B., Theimer, M. M., Petersen, K., Demers, A. J., Spreitzer, M. J., and

Hauser, C. H. 1995. Managing update conflicts in Bayou, a weakly connected

replicated storage system.

Thank You