Embed Size (px)
Transcript of Dynamo cassandra
• Amazon Dynamo• Facebook Cassandra （ Dynama 2.0 ）
– Inbox search• Apache
• Dynamo-like features– Symmetric, P2P architecture– Gossip-based cluster management– DHT– Eventual consistency
• Bigtable-like features– Column family – SSTable disk storage
• Commit log• Memtable• Immutable Sstable files
Data Model(1/2)Data Model(1/2)
• A table is a distributed multi dimensional map indexed by a key– Keyspace– Column– Super Column– Column Family Types
Data Model(2/2)Data Model(2/2)
• Paper:– insert(table; key; rowMutation)– get(table; key; columnName)– delete(table; key; columnName)
• Wiki:– http://wiki.apache.org/cassandra/API
Architecture LayersArchitecture Layers
• Consistent Hash Table
• Problems:– the random position assignment of each node on the
ring leads to non-uniform data and load distribution– the basic algorithm is oblivious to the heterogeneity
in the performance of nodes.• Two Ways:
– Dynamo• One node is assigned to multiple positions in the circle
– Cassandra• Analyze load information on the ring and have lightly
loaded nodes move on the ring to alleviate heavily load nodes.
• Each Cassandra server [node] is assigned a unique Token that determines what keys it is the first replica for.
• Choice– InitialToken: assigned– RandomPartitioner :Tokens are integers from 0 to
2**127. Keys are converted to this range by MD5 hashing for comparison with Tokens.
– NetworkTopologyStrategy:calculate the tokens the nodes in each DC independently. Tokens still needed to be unique, so you can add 1 to the tokens in the 2nd DC, add 2 in the 3rd, and so on.
• high availability and durability• replication_factor:N
• Strategy– Rack Unaware– Rack Aware– Datacenter Aware– …
• Cassandra system elects a leader amongst its nodes using a system called Zookeeper
• All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for
• The leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring.
• The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper
• This way a node that crashes and comes back up knows what ranges it was responsible for.
• Cassandra provides durability guarantees in the presence of node failures and network partitions by relaxing the quorum requirements
Data VersioningData Versioning
• Vector clocks
W + R > N
• put() :– the coordinator generates the vector clock for the new version and
writes the new version locally. – The coordinator then sends the new version (along with the new
vector clock) to the N highest-ranked reachable nodes. – If at least W-1 nodes respond then the write is considered
– the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, a
– waits for R responses before returning the result to the client. – If the coordinator ends up gathering multiple versions of the data, it
returns all the versions it deems to be causally unrelated. The divergent versions are then reconciled and the reconciled version superseding the current versions is written back.
Handling Temporary FailuresHandling Temporary Failures
• Hinted handoffif node A is temporarily down or unreachable during a write operation then a replica that would normally have lived on A will now be sent to node D.
The replica sent to D will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this caseA).
Nodes that receive hinted replicas will keep them in a separate local database that is scanned periodically. Upon detecting that A has recovered, D will attempt to deliver thereplica to A.
Once the transfer succeeds, D may delete the object from its local store without decreasing the total number of replicas in the system.
Handling permanent failuresHandling permanent failures
• Replica synchronization: anti-entropy– To detect the inconsistencies between replicas faster and to
minimize the amount of transferred data
Cassandra Consistency For ReadCassandra Consistency For Read
Level BehaviorANY Not supported. You probably want ONE instead.
Will return the record returned by the first replica to respond. A consistency check is always done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is used. This means subsequent calls will have correct data even if the initial read gets an older value. (This is called ReadRepair)
Will query all replicas and return the record with the most recent timestamp once it has at least a majority of replicas (N / 2 + 1) reported. Again, the remaining replicas will be checked in the background.
Returns the record with the most recent timestamp once a majority of replicas within the local datacenter have replied.
Returns the record with the most recent timestamp once a majority of replicas within each datacenter have replied.
ALLWill query all replicas and return the record with the most recent timestamp once all replicas have replied. Any unresponsive replicas will fail the operation.
Cassandra Consistency For WriteCassandra Consistency For Write
ANY Ensure that the write has been written to at least 1 node, including HintedHandoff recipients.
ONE Ensure that the write has been written to at least 1 replica's commit log and memory table before responding to the client.
QUORUM Ensure that the write has been written to N / 2 + 1 replicas before responding to the client.
Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes, within the local datacenter (requires NetworkTopologyStrategy)
Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes in each datacenter (requires NetworkTopologyStrategy)
ALL Ensure that the write is written to all N replicas before responding to the client. Any unresponsive replicas will fail the operation.
Cassandra Read RepairCassandra Read Repair
• Cassandra repairs data in two ways:– Read Repair: every time a read is performed, Cassandra
compares the versions at each replica (in the background, if a low consistency was requested by the reader to minimize latency), and the newest version is sent to any out-of-date replicas.
– Anti-Entropy: when nodetool repair is run, Cassandra computes a Merkle tree for each range of data on that node, and compares it with the versions on other replicas, to catch any out of sync data that hasn't been read recently. This is intended to be run infrequently (e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the data on the machine (but it is is very network efficient).
• New node• Position
– specify an InitialToken– pick a Token that will give it half the keys from the node with the most disk
space used• Note:
– You should wait long enough for all the nodes in your cluster to become aware of the bootstrapping node via gossip before starting another bootstrap
– Relating to point 1, one can only bootstrap N nodes at a time with automatic token picking, where N is the size of the existing cluster.
– As a safety measure, Cassandra does not automatically remove data from nodes that "lose" part of their Token Range to a newly added node.
– When bootstrapping a new node, existing nodes have to divide the key space before beginning replication.
– During bootstrap, a node will drop the Thrift port and will not be accessible from nodetool
– Bootstrap can take many hours when a lot of data is involved
Moving or Removing nodesMoving or Removing nodes
• Remove nodes– Live node: nodetool decommission
• the data will stream from the decommissioned node
– Dead node: nodetool removetoken• the data will stream from the remaining replicas
• Mode nodes– nodetool move: decommission + bootstrap
• LB– If you add nodes to your cluster your ring will be unbalanced and
only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command.
• Scuttlebutt– Based on Gossip– efficient CPU utilization – efficient utilization of the gossip channel
• anti-entropy Gossip– Paper:Efficient Reconciliation and Flow
Control for Anti-Entropy Protocols
Failure DetectionFailure Detection
• The φ Accrual Failure Detector– Idea: the failure detection module doesn't emit
a Boolean value stating a node is up or down. Instead the failure detection module emits a value which represents a suspicion level for each of monitored nodes
Local Persistence(1/4)Local Persistence(1/4)
• Write Operation:– 1. write into a commit log– 2. an update into an in-memory data structure– 3. When the in-memory data structure crosses a
certain threshold, calculated based on data size and number of objects, it dumps itself to disk
• Read Operation:– 1. query the in-memory data structure– 2. look into the files on disk in the order of newest
to oldest– 3. combine
Local Persistence(2/4)Local Persistence(2/4)
• Commit log– all writes into the commit log are sequential– Fixed size – Create/delete– Durability and recoverability
Local Persistence(3/4)Local Persistence(3/4)
• Memtable– Per column family– a write-back cache of data rows that can be
looked up by key– sorted by key
Local Persistence(4/4)Local Persistence(4/4)
• SStable– Flushing
• Once flushed, SSTable files are immutable; no further writes may be done.
– Compaction• merging multiple old SSTable files into a single new one• Since the input SSTables are all sorted by key, merging can be done
efficiently, still requiring no random i/o.• Once compaction is finished, the old SSTable files may be deleted• Discard tombstones
– index• All writes are sequential to disk and also generate an index for efficient
lookup based on row key. These indices are also persisted along with the data file
• In order to prevent lookups into les that do not contain the key, a bloom filter, summarizing the keys in the le, is also stored in each data le and also kept in memory.
• In order to prevent scanning of every column on disk we maintain column indices which allow us to jump to the right chunk on disk for column retrieval.
Facebook inbox searchFacebook inbox search
Column Family : terms_cf
Super Column ： term1 Super Column: term2 …
mail1 mail2 mail4 … Mail2 mail3 mail4 …
Column Family : interaction_cf
Super Column ： user1 Super Column: user2 …
mail1 Mail3 mail4 … Mail2 mail3 Mail8 …
content4 … content2 content3 content8
• http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf• http://wiki.apache.org/cassandra/FrontPage