Cassandra under the hood

81
under the hood Cassandra 2017

Transcript of Cassandra under the hood

Page 1: Cassandra under the hood

under the hood Cassandra

2017

Page 2: Cassandra under the hood

Who I am

Java Software Engineer @ Lohika

More than 7 years of experience

Andriy Rymar

Page 3: Cassandra under the hood

What we won’t

• Learn how to use Cassandra

• Learn about performance tuning

• Learn how to manage cluster

• Learn how to interact with Cassandra

Page 4: Cassandra under the hood

What we will

We will learn what is Cassandra

Page 5: Cassandra under the hood

Content

• General overview

• Data model

• Architecture

• Read & Write operations

Page 6: Cassandra under the hood

Preface

Page 7: Cassandra under the hood

• RDBMS - is not bad

• RDBMS - has been successful in the last 40 years

RDBMS

Page 8: Cassandra under the hood

• Slow queries due to complex joins, long time to reindexing data

• Expensive vertical scaling and problems with horizontal scaling

• When you try to replicate database you hurt the availability of the system

RDBMSIssues

Page 9: Cassandra under the hood

CAP

consistency availability

partition tolerance

RDBSM

NoSQLNoSQL

Page 10: Cassandra under the hood

CA, CP, AP

• Consistency & Availability

• Consistency & Partition-tolerance

• Availability & Partition-tolerance

Andriy
- Consistency & AvailabilityNo partition-toleranceEverything related to transaction putted into one machine- Consistency & Partition-toleranceIf all nodes are available then data will be consistentWhen node are fail then some data will be not available-Availability & Partition-toleranceHas a risk to produce conflicting results in a case of network failures
Andriy
- ConsistencyEach request will retrieve latest (correct / right) state of whole system (cluster)- AvailabilitySystem have to be always available to serve- Partition toleranceSystem (cluster) that can operate during network failures
Page 11: Cassandra under the hood

Eventual consistent

Eventual consistent system without any failures

Eventual consistent system with failures

V0

V0

V0

V0 V0

V1

V0

V0

V1

V1 V1

V1

V1

V1 V1

V0

V1

V1

V1

V1V1

V1V1

Page 12: Cassandra under the hood

Solution

Google BigTable2004

Cassandra2008 (2010 , 2013)

Amazon Dynamo DB2012

Page 13: Cassandra under the hood

CassandraGeneral Overview

Page 14: Cassandra under the hood

Cassandra cluster

N1

N2

N3

A

G

R

Tokens & Seed node & Ring representation

A - F

G - Q

R - Z

Tokens - determine position of node in ring cluster and portion of data

N1

Page 15: Cassandra under the hood

Cassandra cluster

N1

N2

N3

A-F

G-Q

R-Z

pk: «Taras», message: «Hello»

Replication Factor (RF) = 2

G-Q

R-Z

A-F

Page 16: Cassandra under the hood

TokensIssues

• Manually manage token initial value for all nodes

• Big overhead when restoring node data

for(int i=0; i < CLUSTER_SIZE; i++) {System.out.println((((2**64 / CLUSTER_SIZE) * i) - 2**63))

}

N1

N2N3

Replication Factor (RF) = 2

NewN2

Page 17: Cassandra under the hood

Virtual Nodes

12

3 45

6

78

91011

12

Server1 Server2

Server3Server4

Page 18: Cassandra under the hood

Virtual NodesData restoring

vnode = 3S1

S3

S2S4

RF = 2

Page 19: Cassandra under the hood

V-nodesSummary

• Rebalancing a cluster is no longer necessary when adding or removing nodes

• More powerful machines can have more v-nodes. This approach give ability to build heterogeneous Cassandra ring

Andriy
Rebalancing a cluster is no longer necessary when adding or removing nodes. When a node joins the cluster, it assumes responsibility for an even portion of data from the other nodes in the cluster. If a node fails, the load is spread evenly across other nodes in the cluster.
Page 20: Cassandra under the hood

CassandraData model

Page 21: Cassandra under the hood

Introduction into data model

KEYSPACE

Table (column family)

partition keycolumn1 column2 column3

model123

value value

age [email protected]

namedemo14

value

Andriy Rymar
Note that if you come with Cassandra Thrift experience, it might be hard to view how Cassandra 1.2 and newer versions have changed terminology. Before CQL, the tables were called column families. A column family holds a group of rows, and rows are a sorted set of columns.
Page 22: Cassandra under the hood

Column family• RDBMS

username email title age

TarasAndriy

[email protected]

[email protected]

Staff Engineer

27

• Column Family

user

key: Taras

key: Andriy

value:

value:

email : [email protected] title: Staff Engineer

email : [email protected] age: 27

Page 23: Cassandra under the hood

Column family

“user” : {“Taras” : {

“email” : “[email protected]”,“title” : “Staff

Engineer”},“Andriy” : {

“email” : “[email protected]”,“age” : “27”

}}

user

key: Taras

key: Andriy

value:

value:

email : [email protected]

title: Staff Engineer

email : [email protected] age: 27

Page 24: Cassandra under the hood

Other differences

• No relations (No Joins)

• Tuples (key-value pairs) are natural sorted

• May want to denormalize data model in database

• No transactions

Andriy Rymar
One obvious benefit of having such a flexible data storage mechanism is that you can have arbitrary number of cells with customized names and have a partition key store data as a list of tuples (a tuple is an ordered set; in this case, the tuple is a key-value pair). This comes handy when you have to store things such as time series, for example, if you want to use Cassandra to store your Facebook timeline or your Twitter feed or you want the partition key to be a sensor ID and each cell to represent a tuple with name as the timestamp when the data was created and value as the data sent by the sensor. Also, in a partition, cells are by default naturally ordered by the cell's name. So, in our sensor case, you will get data sorted for free. The other difference is, unlike RDBMS, Cassandra does not have relations. This means relational logic will be needed to be handled at the application level. This also means that we may want to denormalize the database because there is no join and to avoid looking up multiple tables by running multiple queries. Denormalization is a process of adding redundancy in data to achieve high read performance.
Page 25: Cassandra under the hood

Type of keys

• Primary key

• Composite key

• Partition key

• Clustering key

• Composite partition key

Andriy Rymar
Primary key: This is the column or a group of columns that uniquely defines a row of the CQL table.Composite key: This is a type of primary key that is made up of more than one column. Sometimes, the composite key is also referred to as the compound key.Partition key: Cassandra's internal data representation is large rows with a unique key called row key. It uses these row key values to distribute data across cluster nodes. Since these row keys are used to partition data, they as called partition keys. When you define a table with a simple key, that key is the partition key. If you define a table with a composite key, the first term of that composite key works as the partition key. This means all the CQL rows with the same partition key lives on one machine.Clustering key: This is the column that tells Cassandra how the data within a partition is ordered (or clustered). This essentially provides presorted retrieval if you know what order you want your data to be retrieve in.Composite partition key: Optionally, CQL lets you define a composite partition key (the first part of a composite key). This key helps you distribute data across nodes if any part of the composite partition key differs
Page 26: Cassandra under the hood

Example 1

CREATE TABLE album ( id uuid, name name, PRIMARY KEY (id))

Primary key and also the partition key

id - partition & primary key at the same time

Andriy Rymar
There is no clustering. It is a simple key
Page 27: Cassandra under the hood

Composite key

Example 2

CREATE TABLE author_book ( author text, book text, population int, PRIMARY KEY (author, book))

partition key primary key

Andriy Rymar
In the preceding example, we have a composite key that uses author and book to uniquely define a CQL row. The author column is the partition key, so all the rows with the same author will belong to the same node/machine. The rows within a partition will be sorted by the book names.
Page 28: Cassandra under the hood

Example 3Key with composite partition & clustering keys

CREATE TABLE teacher_lesson ( teacher text, lesson text, topic text, duration int, PRIMARY KEY ((teacher, lesson), topic, duration))

clustering keyscomposite partition key

Andriy Rymar
The preceding example has a composite key involving four columns: teacher, lesson, topic, and duration, with teacher and lesson constituting composite partition key. This means the rows with the same teacher but different lesson will be in a different partition. Rows will be ordered by the topic followed by the duration.
Page 29: Cassandra under the hood

Row vs PartitionRows

Partitions

Node 1 Node 2

1234

5678

9101112

1234:user 5678:user

1234:address 5678:address

1234:details 5678:details

Page 30: Cassandra under the hood

Coffee break

• General overview

• Data model

• Architecture

• Read & Write operations

Page 31: Cassandra under the hood

CassandraArchitecture

Page 32: Cassandra under the hood

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Page 33: Cassandra under the hood

Cassandra components

API Tools

Storage layer

Partitioner Replicator

Failure detector Compaction Manager

Messaging layer

Page 34: Cassandra under the hood

Messaging service

In cluster of 5 nodes , each node has 8 opened socket connections

Has 2 opened socket connections with every other node

Page 35: Cassandra under the hood

Gossip

Page 36: Cassandra under the hood

GossipHow Cassandra initiates sessions?

• One session for any random live node

• One session for any random unreachable node

• If the node in point 1 is not a seed node, then create session with random seed node

Page 37: Cassandra under the hood

GossipSession

1 : GossipSyncMessage

N1 N22 : GossipAckMessage

3 : GossipAck2Message

Page 38: Cassandra under the hood

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Page 39: Cassandra under the hood

Failure detectionϕ accrual failure detector

• Doesn’t use TRUE / FALSE

• Provides continuos value• This value is called «ϕ»

Andriy
ϕthres can be understood like this. Let's say we start to suspect whether a node is dead when ϕ >= ϕthres. When ϕthres is 1, it is equivalent to - log(0.1). The probability that we will make a mistake (that is, the decision that the node is dead will be contradicted in future by a late arriving heartbeat) is 0.1 or 10 percent. Similarly, with ϕthres = 2, the probability of making a mistake goes down to 1 percent; with ϕthres = 3, it drops to 0.1 percent; and so on, following log base 10 formula.
Page 40: Cassandra under the hood

Failure detectionϕ accrual failure detectortim

e

session

1 2 3 4 5

1s2s

Page 41: Cassandra under the hood

Failure detection

Proposed by Xavier Défago in 2004

https://goo.gl/xS0kB0

Page 42: Cassandra under the hood

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Page 43: Cassandra under the hood

Partitioner

All terabytes of data

N1

N2

N3

N4

N5

N6

N7

N8

Page 44: Cassandra under the hood

Partitioner

• Murmur 3 Partitioner

• Random Partitioner

• Byte Order Partitioner

Page 45: Cassandra under the hood

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Page 46: Cassandra under the hood

Replicator

• Replication factor = 3

Write data request

N1 N2 N3 N4N1 N2 N3

• Consistency Level = 2

N1 N2

Page 47: Cassandra under the hood

Consistency level

• ZERO (write only)

• ANY (write only)

• ONE

• QUORUM

• ALL

Push and forget

Success even hinted of write

First replica returned successfully

N/2 +1 replica success

All replica success

Replicator

Page 48: Cassandra under the hood

Inconsistency• 5 node cluster

• Replication factor 3

• Consistency level 1

N1 N2 N3 N4 N5

Write Read

N2 N3 N4

Replicator

Page 49: Cassandra under the hood

Tuning

• Use consistency level with at least 1 node overlap (Quorum)

Write CL = 2 Read CL = 2

Replication factor = 3

N1 N2 N3 N4 N5

Write Read

N2 N3 N4

Replicator

Page 50: Cassandra under the hood

Tuning

• Tune read and write CL separately to reach high performance

Fast write Fast Read

Write CL = 1 Read CL = ALL Write CL = ALL Read CL = 1

Replicator

Page 51: Cassandra under the hood

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Page 52: Cassandra under the hood

Storage layer

Client

Mutation Request

Commit log

MemTable

SSTable

mem

hdd

add / update

appendFlush

cleanup

Page 53: Cassandra under the hood

Storage layer

Client

Mutation Request

Commit log

MemTable

SSTable

mem

hdd

add / update

appendFlush

cleanup

Page 54: Cassandra under the hood

SSTable

• Representation of MemTable

• Immutable

• Eventually get merged into larger SSTable files (compaction)

• Has next components• Bloom filter

• Index file

• Data file

Page 55: Cassandra under the hood

SSTableBloom filter

• Bloom filter is used to determine correct SSTable

• Bloom filter may result as FALSE positive

• Stored on heap memory

Page 56: Cassandra under the hood

SSTableBloom filter

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 1 0

0 0 5 0 0 2 0 1 1 3 0 0 1 0 0 2 0

murmur3(`key`) = 15

Page 57: Cassandra under the hood

SSTableIndex file

• Contains all row keys and their offset in data file

• Each 128th key from the file will be stored into memory

• Use binary search to determine right index in memory

Page 58: Cassandra under the hood

SSTableIndex file

memory hdd

126,…127,…128,…

…199,…200,…201,…202,…203,…

Index file

1, 128, 256, 384

Sampled IndexBF

201

Page 59: Cassandra under the hood

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Page 60: Cassandra under the hood

Compaction

• Merges SSTables

• There are two compaction strategies• size-tiered

• leveled

Page 61: Cassandra under the hood

Compaction

A B A B C C D E C

C F

Size-tiered (Minor compaction)

D E C F …

Compaction Level = 2

Page 62: Cassandra under the hood

Compaction

B C DE

Size-tiered (Major compaction)

F

A

Page 63: Cassandra under the hood

Data repair

• Hinted handoff

• Read repair

• Anti-entropy

Page 64: Cassandra under the hood

Coffee break

• General overview

• Data model

• Architecture

• Read & Write operations

Page 65: Cassandra under the hood

CassandraRead & Write operations

Page 66: Cassandra under the hood

Write

Write Request

Node1

Stor

age

Prox

y

Node2

Commit Log

MemTable

SSTable

Page 67: Cassandra under the hood

WriteReplication & Consistency

N1

N4

Replication Factor = 3

N2 N3

Consistency Level = 2

Anti-entropy / Read repair

Hinted handoff

Page 68: Cassandra under the hood

Read

Snitch Function

Read request

Node1

Stor

age

Prox

y

?

Page 69: Cassandra under the hood

ReadSnitch Function

• SimpleSnitch

• DynamicSnitch

• PropertyFileSnitch

• GossipingPropertyFileSnitch

• RackInferringSnitch

Page 70: Cassandra under the hood

ReadSnitch functions

SimpleSnitch

N1N2

N3

N4N5

N6

N7

Page 71: Cassandra under the hood

ReadSnitch functions

DynamicSnitch

N1

N1

N2

N3

0.6ms

0.4ms

0.9ms

1

2

3

Page 72: Cassandra under the hood

ReadSnitch functions

GossipingPropertyFileSnitch

$CASSANDRA_HOME/conf/cassandra-rackdc.properties

# indicate the rack and dc for this nodedc=DC1rack=RAC1

Page 73: Cassandra under the hood

ReadIn action

• 7 node • RF = 4 • CL = 3

Read request

Node1St

orag

e Pr

oxy Read data

Get digestGet

digest

Node 3

Node 4

Node 5

Node 6

Node 7

Node 2

Node 3

Node 4

Node 5

Page 74: Cassandra under the hood

Read on node

B

SSTable

I

B B I

B

Page 75: Cassandra under the hood

Almost the end

Page 76: Cassandra under the hood

Nishant Neeraj : «Mastering Apache Cassandra - Second Edition»

Throughput comparison

Page 77: Cassandra under the hood

PublicationA Real Comparison Of NoSQL Databases HBase, Cassandra & MongoDB

https://goo.gl/z5abRu

Page 78: Cassandra under the hood

Summary

Page 79: Cassandra under the hood

The end

Page 80: Cassandra under the hood

Anyquestions?problems

Page 81: Cassandra under the hood

Resources

Nishant Neeraj Mastering Apache Cassandra - 2015

http://docs.datastax.com/en/cassandra/3.x/cassandra/

cassandraAbout.html

Thank you