Schlosser Dynamo
Embed Size (px)
Transcript of Schlosser Dynamo
-
8/12/2019 Schlosser Dynamo
1/21
Dynamo: Amazon's HighlyAvailable Key-value Store
Guiseppe DeCandia, Deniz Hastorun,Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin,Swami Sivasubramanian, Peter Vosshall,
and Werner Vogels
Presented by Steve SchlosserBig Data Reading Group
October 1, 2007
-
8/12/2019 Schlosser Dynamo
2/21
What Dynamo is
Dynamo is a highly available distributed key-value storage system
put(), get() interface
Sacrifices consistency for availability
Provides storage for some of Amazon's keyproducts (e.g., shopping carts, best seller lists, etc.)
Uses synthesis of well known techniques toachieve scalability and availability
Consistent hashing, object versioning, conflict resolution,etc.
-
8/12/2019 Schlosser Dynamo
3/21
Scale
Amazon is busy during the holidays
Shopping cart: tens of millions of requests for 3million checkouts in a single day
Session state system: 100,000s of concurrentlyactive sessions
Failure is common
Small but significant number of server and networkfailures at all times
Customers should be able to view and add items to their shopping cart evenif disks are failing, network routes are flapping, or data centers are beingdestroyed by tornados.
-
8/12/2019 Schlosser Dynamo
4/21
Flexibility
Minimal need for manual administration
Nodes can be added or removed withoutmanual partitioning or redistribution
Apps can control availability, consistency, cost-effectiveness, performance
Can developers know this up front?
Can it be changed over time?
-
8/12/2019 Schlosser Dynamo
5/21
Assumptions & requirements
Simple query model
values are small (
-
8/12/2019 Schlosser Dynamo
6/21
Service level agreements
SLAs are used widely at Amazon
Sub-services must meet strict SLAs
e.g., 300ms response time for 99.9% of requests atpeak load of 500 requests/s
Average-case SLAs are not good enough
Mentioned a cost-benefit analysis that said 99.9% is
the right number Rendering a single page can make requests to
150 services
-
8/12/2019 Schlosser Dynamo
7/21
Consistency
Eventual consistency
Always writable
Can always write to shopping cart
Pushes conflict resolution to reads
Application-driven conflict resolution
e.g., merge conflicting shopping carts
Or Dynamo enforces last-writer-wins
How often does this work?
-
8/12/2019 Schlosser Dynamo
8/21
Other stuff
Incremental scalability
Minimal management overhead
Symmetry
No master/slave nodes
Decentralized
Centralized control leads to too many failures
Heterogeneity
Exploit capabilities of different nodes
-
8/12/2019 Schlosser Dynamo
9/21
Interface
get(key) returns object replica(s) for key, plus acontext object
context encodes metadata, opaque to caller
put(key, context, object) stores object
-
8/12/2019 Schlosser Dynamo
10/21
Variant of consistent hashing
A
B
C
DE
F
G
Key K
Each node isassigned tomultiple points
in the ring(e.g., B, C, Dstore keyrange(A, B)
# of points canbe assigned basedon nodes capacity
If node becomesunavailable, load isdistributed to other
-
8/12/2019 Schlosser Dynamo
11/21
Replication
A
B
C
DE
F
G
Key KCoordinator for key K
D stores (A, B], (B, C], (C, D]
B maintains apreference
list for each data itemspecifying nodes storingthat item
Preference list skipsvirtual nodes in favor ofphysical nodes
-
8/12/2019 Schlosser Dynamo
12/21
Data versioning
put() can return before update is applied to all replicas
Subsequent get()s can return older versions
This is okay for shopping carts
Branched versions are collapsed
Deleted items can resurface
A vector clock is associated with each object version
Comparing vector clocks can determine whether twoversions are parallel branches or causally ordered
Vector clocks passed by the contextobject in get()/put()
Application must maintain this metadata?
-
8/12/2019 Schlosser Dynamo
13/21
Vector clock example
-
8/12/2019 Schlosser Dynamo
14/21
Quorum-likeness
get() & put() driven by two parameters:
R: the minimum number of replicas to read
W: the minimum number of replicas to write
R + W > N yields a quorum-like system
Latency is dictated by the slowest R (or W) replicas
Sloppy quorum to tolerate failures
Replicas can be stored on healthy nodes downstream in thering, with metadata specifying that the replica should be sentto the intended recipient later
-
8/12/2019 Schlosser Dynamo
15/21
Adding and removing nodes
Explicit commands issued via CLI or browser
Gossip-style protocol propagates changesamong nodes
New node chooses virtual nodes in the hash space
-
8/12/2019 Schlosser Dynamo
16/21
Implementation
Persistent store either Berkeley DBTransactional Data Store, BDB Java Edition,MySQL, or in-memory buffer w/ persistent
backend All in Java!
Common N, R, W setting is (3, 2, 2)
Results are from several hundred nodesconfigured as (3, 2, 2)
Not clear whether they run in a single datacenter
-
8/12/2019 Schlosser Dynamo
17/21
One tick= 12 hours
-
8/12/2019 Schlosser Dynamo
18/21
One tick= 1 hour
-
8/12/2019 Schlosser Dynamo
19/21
One tick
= 30 minutes
During periods of high loadpopular objects dominate
During periods of low load,fewer popular objects are accessed
-
8/12/2019 Schlosser Dynamo
20/21
Quantifying divergent versions
In a 24 hour trace
99.94% of requests saw exactly one version
0.00057% received 2 versions
0.00047% received 3 versions
0.00009% received 4 versions
Experience showed that diversion came usually
from concurrent writers due to automated clientprograms (robots), not humans
-
8/12/2019 Schlosser Dynamo
21/21
Conclusions
Scalable: Easy to shovel in more capacity at Christmas
Simple:
get()/put() maps well to Amazons workload
Flexible: Apps can set N, R, W to match their needs
Inflexible:
Apps have to set N, R, W to match their needs
Apps may have to do their own conflict resolution
They claim its easy to set these does this mean that there arent manyinteresting points?
Interesting?