Cassandra under the hood
-
Upload
andriy-rymar -
Category
Engineering
-
view
495 -
download
0
Transcript of Cassandra under the hood
under the hood Cassandra
2017
Who I am
Java Software Engineer @ Lohika
More than 7 years of experience
Andriy Rymar
What we won’t
• Learn how to use Cassandra
• Learn about performance tuning
• Learn how to manage cluster
• Learn how to interact with Cassandra
What we will
We will learn what is Cassandra
Content
• General overview
• Data model
• Architecture
• Read & Write operations
Preface
• RDBMS - is not bad
• RDBMS - has been successful in the last 40 years
RDBMS
• Slow queries due to complex joins, long time to reindexing data
• Expensive vertical scaling and problems with horizontal scaling
• When you try to replicate database you hurt the availability of the system
RDBMSIssues
CAP
consistency availability
partition tolerance
RDBSM
NoSQLNoSQL
CA, CP, AP
• Consistency & Availability
• Consistency & Partition-tolerance
• Availability & Partition-tolerance
Eventual consistent
Eventual consistent system without any failures
Eventual consistent system with failures
V0
V0
V0
V0 V0
V1
V0
V0
V1
V1 V1
V1
V1
V1 V1
V0
V1
V1
V1
V1V1
V1V1
Solution
Google BigTable2004
Cassandra2008 (2010 , 2013)
Amazon Dynamo DB2012
CassandraGeneral Overview
Cassandra cluster
N1
N2
N3
A
G
R
Tokens & Seed node & Ring representation
A - F
G - Q
R - Z
Tokens - determine position of node in ring cluster and portion of data
N1
Cassandra cluster
N1
N2
N3
A-F
G-Q
R-Z
pk: «Taras», message: «Hello»
Replication Factor (RF) = 2
G-Q
R-Z
A-F
TokensIssues
• Manually manage token initial value for all nodes
• Big overhead when restoring node data
for(int i=0; i < CLUSTER_SIZE; i++) {System.out.println((((2**64 / CLUSTER_SIZE) * i) - 2**63))
}
N1
N2N3
Replication Factor (RF) = 2
NewN2
Virtual Nodes
12
3 45
6
78
91011
12
Server1 Server2
Server3Server4
Virtual NodesData restoring
vnode = 3S1
S3
S2S4
RF = 2
V-nodesSummary
• Rebalancing a cluster is no longer necessary when adding or removing nodes
• More powerful machines can have more v-nodes. This approach give ability to build heterogeneous Cassandra ring
CassandraData model
Introduction into data model
KEYSPACE
Table (column family)
partition keycolumn1 column2 column3
model123
value value
namedemo14
value
Column family• RDBMS
username email title age
…
TarasAndriy
Staff Engineer
27
• Column Family
user
…
key: Taras
key: Andriy
value:
value:
email : [email protected] title: Staff Engineer
email : [email protected] age: 27
Column family
“user” : {“Taras” : {
“email” : “[email protected]”,“title” : “Staff
Engineer”},“Andriy” : {
“email” : “[email protected]”,“age” : “27”
}}
user
…
key: Taras
key: Andriy
value:
value:
email : [email protected]
title: Staff Engineer
email : [email protected] age: 27
Other differences
• No relations (No Joins)
• Tuples (key-value pairs) are natural sorted
• May want to denormalize data model in database
• No transactions
Type of keys
• Primary key
• Composite key
• Partition key
• Clustering key
• Composite partition key
Example 1
CREATE TABLE album ( id uuid, name name, PRIMARY KEY (id))
Primary key and also the partition key
id - partition & primary key at the same time
Composite key
Example 2
CREATE TABLE author_book ( author text, book text, population int, PRIMARY KEY (author, book))
partition key primary key
Example 3Key with composite partition & clustering keys
CREATE TABLE teacher_lesson ( teacher text, lesson text, topic text, duration int, PRIMARY KEY ((teacher, lesson), topic, duration))
clustering keyscomposite partition key
Row vs PartitionRows
Partitions
Node 1 Node 2
1234
5678
9101112
1234:user 5678:user
1234:address 5678:address
1234:details 5678:details
Coffee break
• General overview
• Data model
• Architecture
• Read & Write operations
CassandraArchitecture
Cassandra components
API ToolsStorage
layerPartitioner Replicator
Failure detector Compaction Manager
Messaging layer
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
Messaging service
In cluster of 5 nodes , each node has 8 opened socket connections
Has 2 opened socket connections with every other node
Gossip
GossipHow Cassandra initiates sessions?
• One session for any random live node
• One session for any random unreachable node
• If the node in point 1 is not a seed node, then create session with random seed node
GossipSession
1 : GossipSyncMessage
N1 N22 : GossipAckMessage
3 : GossipAck2Message
Cassandra components
API ToolsStorage
layerPartitioner Replicator
Failure detector Compaction Manager
Messaging layer
Failure detectionϕ accrual failure detector
• Doesn’t use TRUE / FALSE
• Provides continuos value• This value is called «ϕ»
Failure detectionϕ accrual failure detectortim
e
session
1 2 3 4 5
1s2s
Failure detection
Proposed by Xavier Défago in 2004
https://goo.gl/xS0kB0
Cassandra components
API ToolsStorage
layerPartitioner Replicator
Failure detector Compaction Manager
Messaging layer
Partitioner
All terabytes of data
N1
N2
N3
N4
N5
N6
N7
N8
Partitioner
• Murmur 3 Partitioner
• Random Partitioner
• Byte Order Partitioner
Cassandra components
API ToolsStorage
layerPartitioner Replicator
Failure detector Compaction Manager
Messaging layer
Replicator
• Replication factor = 3
Write data request
N1 N2 N3 N4N1 N2 N3
• Consistency Level = 2
N1 N2
Consistency level
• ZERO (write only)
• ANY (write only)
• ONE
• QUORUM
• ALL
Push and forget
Success even hinted of write
First replica returned successfully
N/2 +1 replica success
All replica success
Replicator
Inconsistency• 5 node cluster
• Replication factor 3
• Consistency level 1
N1 N2 N3 N4 N5
Write Read
N2 N3 N4
Replicator
Tuning
• Use consistency level with at least 1 node overlap (Quorum)
Write CL = 2 Read CL = 2
Replication factor = 3
N1 N2 N3 N4 N5
Write Read
N2 N3 N4
Replicator
Tuning
• Tune read and write CL separately to reach high performance
Fast write Fast Read
Write CL = 1 Read CL = ALL Write CL = ALL Read CL = 1
Replicator
Cassandra components
API ToolsStorage
layerPartitioner Replicator
Failure detector Compaction Manager
Messaging layer
Storage layer
Client
Mutation Request
Commit log
MemTable
SSTable
mem
hdd
add / update
appendFlush
cleanup
Storage layer
Client
Mutation Request
Commit log
MemTable
SSTable
mem
hdd
add / update
appendFlush
cleanup
SSTable
• Representation of MemTable
• Immutable
• Eventually get merged into larger SSTable files (compaction)
• Has next components• Bloom filter
• Index file
• Data file
SSTableBloom filter
• Bloom filter is used to determine correct SSTable
• Bloom filter may result as FALSE positive
• Stored on heap memory
SSTableBloom filter
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 1 0
0 0 5 0 0 2 0 1 1 3 0 0 1 0 0 2 0
murmur3(`key`) = 15
SSTableIndex file
• Contains all row keys and their offset in data file
• Each 128th key from the file will be stored into memory
• Use binary search to determine right index in memory
SSTableIndex file
memory hdd
126,…127,…128,…
…199,…200,…201,…202,…203,…
…
Index file
1, 128, 256, 384
Sampled IndexBF
201
Cassandra components
API ToolsStorage
layerPartitioner Replicator
Failure detector Compaction Manager
Messaging layer
Compaction
• Merges SSTables
• There are two compaction strategies• size-tiered
• leveled
Compaction
A B A B C C D E C
C F
Size-tiered (Minor compaction)
D E C F …
Compaction Level = 2
Compaction
B C DE
Size-tiered (Major compaction)
F
A
Data repair
• Hinted handoff
• Read repair
• Anti-entropy
Coffee break
• General overview
• Data model
• Architecture
• Read & Write operations
CassandraRead & Write operations
Write
Write Request
Node1
Stor
age
Prox
y
Node2
Commit Log
MemTable
SSTable
WriteReplication & Consistency
N1
N4
Replication Factor = 3
N2 N3
Consistency Level = 2
Anti-entropy / Read repair
Hinted handoff
Read
Snitch Function
Read request
Node1
Stor
age
Prox
y
?
ReadSnitch Function
• SimpleSnitch
• DynamicSnitch
• PropertyFileSnitch
• GossipingPropertyFileSnitch
• RackInferringSnitch
…
ReadSnitch functions
SimpleSnitch
N1N2
N3
N4N5
N6
N7
ReadSnitch functions
DynamicSnitch
N1
N1
N2
N3
0.6ms
0.4ms
0.9ms
1
2
3
ReadSnitch functions
GossipingPropertyFileSnitch
$CASSANDRA_HOME/conf/cassandra-rackdc.properties
# indicate the rack and dc for this nodedc=DC1rack=RAC1
ReadIn action
• 7 node • RF = 4 • CL = 3
Read request
Node1St
orag
e Pr
oxy Read data
Get digestGet
digest
Node 3
Node 4
Node 5
Node 6
Node 7
Node 2
Node 3
Node 4
Node 5
Read on node
B
SSTable
I
B B I
B
Almost the end
Nishant Neeraj : «Mastering Apache Cassandra - Second Edition»
Throughput comparison
PublicationA Real Comparison Of NoSQL Databases HBase, Cassandra & MongoDB
https://goo.gl/z5abRu
Summary
The end
Anyquestions?problems
Resources
Nishant Neeraj Mastering Apache Cassandra - 2015
http://docs.datastax.com/en/cassandra/3.x/cassandra/
cassandraAbout.html
Thank you