Amazon`s
Presented By
Sarang Metkar
Dynamo: Highly Available Key-value Store
Giuseppe DeCandiaDeniz HastorunMadan JampaniGunavardhan Kakulapatiand others
Introduction
• Highly available and scalable distributed data store
• Flexible key – value data model
Key-Value Data Model
• Simple Key-Value pairs
• Table - collection of items
• Item - collections of attributes
Introduction
• Highly available and scalable distributed data store
• Flexible key – value data model
• Fast performance with seamless scalability
• Eventually consistent
• Decentralized system
Motivation
• ‘Always On’ experience to large customer base
• Reduce impact of failure without compromising performance
• Diverse applications with different storage and data access requirement
• Configurable to achieve stringent SLAs
SLA requirements
• Decentralized service
oriented architecture
• Multiple dependencies
hence tight constraints
• 99.9th percentile SLA
measurement
Reference : Dynamo: Amazon’s Highly Available Key-value Store : Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Alex Pilchin, Peter Vosshalland Werner Vogels
Design Considerations - Consistency
• Availability using optimistic replication - eventual consistency
• Challenges in conflict resolution
• When to resolve ?
- Always writable requirement
• Who resolves?
- Application assisted
- Data store`s “last write wins” policy
Other Design Considerations
• Incremental Scalability
• Symmetry
• Decentralization
• Heterogeneity
System Interface• Object storage and access
• get(key)
- Locate object replicas
- Return single or list of objects
• put(key, context, object)
- Determine location of replica
- Context for conflict resolution
Partition and Replication
• Consistent hashing for load and data
distribution
• Less impact of addition or deletion of
nodes
• Virtual nodes account for heterogeneity
• Coordinator node stores preference list
Eventual Consistency
• Asynchronous updates of replicas
• Versioning, based on vector clocks
• Reconciliation
• Syntactic reconciliation
• Semantic reconciliation
• Sloppy quorum like consistency
protocol
• Configurable R , W and N
[ R + W > N ]
Reference : Dynamo: Amazon’s Highly Available Key-value Store : Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Alex Pilchin, Peter Vosshalland Werner Vogels
Handling failures
• High Availability and Durability requirement
• Hinted handoff – temporary failures
• Replica synchronization – permanent failures
• Merkle Trees
- Less data transfer and faster replication
- One for each key range on node
- Recalculation of tree on key range changes
Membership and Failure Detection
• Manual addition or removal of nodes
• Gossip based protocol to reconcile membership changes
• Partitioning and node-to-token sets mapping information propagation
• Seeds avoids logical partitioning
• Decentralized failure detection
Implementation
• Local persistence engine
• Application specific
• Pluggable
• Request coordination
• Read/write request execution
• Read repair
• ‘Read-your-writes’ consistency
• Java NIO channel
• Membership and failure detection
Key Learnings - 1• Common (N,R,W) configuration – (3, 2, 2)
• Balancing Performance and Durability• Buffering of write operations
• Durable write
Key Learnings [cont`d]
• Uniform load distribution• More load imbalance for low load
• Q/S tokens per node, equal-sized partition• Faster bootstrapping and recovery
• Ease of archival
Key Learnings [cont`d]
• Divergent Versions• Failures in system
• Concurrent writes to single object by multiple nodes
• Client driven coordination• Request coordination at client
• Pull membership information
• Reduces latency
• Admission Control mechanism for background tasks
Related Work
• Peer to Peer systems
• Unstructured peer-to-peer network
• Gnutella [1]
• Freenet [2]
• Structure peer-to-peer network
• Oceanstore [3]
• Beehive [4]
• Distributed File Systems and Databases
• Google File System [5]
• Bayou [6]
Conclusion
• Application specific configuration for availability, durability, performance and consistency
• Evaluation of different techniques to build highly available system
• Use of eventually consistent storage system in production
• Tuning of various techniques to meet strict production performance requirements
References
• [1] http://www.gnutella.org/
• [2] http://freenetproject.org/
• [3] Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D.,
Gummadi, R., Rhea, S., Weatherspoon, H., Wells, C., and Zhao, B. 2000.
OceanStore: an architecture for global-scale persistent storage.
• [4] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google file system. In
Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles
• [5] Ramasubramanian, V., and Sirer, E. G. Beehive: O(1)lookup performance for
power-law query distributions in peer-to-peer overlays.
• [6] Terry, D. B., Theimer, M. M., Petersen, K., Demers, A. J., Spreitzer, M. J., and
Hauser, C. H. 1995. Managing update conflicts in Bayou, a weakly connected
replicated storage system.
Thank You
Top Related