Cassandra under the hood

Post on 08-Apr-2017

495 views 0 download

Transcript of Cassandra under the hood

under the hood Cassandra

2017

Who I am

Java Software Engineer @ Lohika

More than 7 years of experience

Andriy Rymar

What we won’t

• Learn how to use Cassandra

• Learn about performance tuning

• Learn how to manage cluster

• Learn how to interact with Cassandra

What we will

We will learn what is Cassandra

Content

• General overview

• Data model

• Architecture

• Read & Write operations

Preface

• RDBMS - is not bad

• RDBMS - has been successful in the last 40 years

RDBMS

• Slow queries due to complex joins, long time to reindexing data

• Expensive vertical scaling and problems with horizontal scaling

• When you try to replicate database you hurt the availability of the system

RDBMSIssues

CAP

consistency availability

partition tolerance

RDBSM

NoSQLNoSQL

CA, CP, AP

• Consistency & Availability

• Consistency & Partition-tolerance

• Availability & Partition-tolerance

Andriy
- Consistency & AvailabilityNo partition-toleranceEverything related to transaction putted into one machine- Consistency & Partition-toleranceIf all nodes are available then data will be consistentWhen node are fail then some data will be not available-Availability & Partition-toleranceHas a risk to produce conflicting results in a case of network failures
Andriy
- ConsistencyEach request will retrieve latest (correct / right) state of whole system (cluster)- AvailabilitySystem have to be always available to serve- Partition toleranceSystem (cluster) that can operate during network failures

Eventual consistent

Eventual consistent system without any failures

Eventual consistent system with failures

V0

V0

V0

V0 V0

V1

V0

V0

V1

V1 V1

V1

V1

V1 V1

V0

V1

V1

V1

V1V1

V1V1

Solution

Google BigTable2004

Cassandra2008 (2010 , 2013)

Amazon Dynamo DB2012

CassandraGeneral Overview

Cassandra cluster

N1

N2

N3

A

G

R

Tokens & Seed node & Ring representation

A - F

G - Q

R - Z

Tokens - determine position of node in ring cluster and portion of data

N1

Cassandra cluster

N1

N2

N3

A-F

G-Q

R-Z

pk: «Taras», message: «Hello»

Replication Factor (RF) = 2

G-Q

R-Z

A-F

TokensIssues

• Manually manage token initial value for all nodes

• Big overhead when restoring node data

for(int i=0; i < CLUSTER_SIZE; i++) {System.out.println((((2**64 / CLUSTER_SIZE) * i) - 2**63))

}

N1

N2N3

Replication Factor (RF) = 2

NewN2

Virtual Nodes

12

3 45

6

78

91011

12

Server1 Server2

Server3Server4

Virtual NodesData restoring

vnode = 3S1

S3

S2S4

RF = 2

V-nodesSummary

• Rebalancing a cluster is no longer necessary when adding or removing nodes

• More powerful machines can have more v-nodes. This approach give ability to build heterogeneous Cassandra ring

Andriy
Rebalancing a cluster is no longer necessary when adding or removing nodes. When a node joins the cluster, it assumes responsibility for an even portion of data from the other nodes in the cluster. If a node fails, the load is spread evenly across other nodes in the cluster.

CassandraData model

Introduction into data model

KEYSPACE

Table (column family)

partition keycolumn1 column2 column3

model123

value value

age emailtest@test.com

namedemo14

value

Andriy Rymar
Note that if you come with Cassandra Thrift experience, it might be hard to view how Cassandra 1.2 and newer versions have changed terminology. Before CQL, the tables were called column families. A column family holds a group of rows, and rows are a sorted set of columns.

Column family• RDBMS

username email title age

TarasAndriy

tm@gm.com

ar@gm.com

Staff Engineer

27

• Column Family

user

key: Taras

key: Andriy

value:

value:

email : tm@gm.com title: Staff Engineer

email : ar@gm.com age: 27

Column family

“user” : {“Taras” : {

“email” : “tm@gm.com”,“title” : “Staff

Engineer”},“Andriy” : {

“email” : “ar@gm.com”,“age” : “27”

}}

user

key: Taras

key: Andriy

value:

value:

email : tm@gm.com

title: Staff Engineer

email : ar@gm.com age: 27

Other differences

• No relations (No Joins)

• Tuples (key-value pairs) are natural sorted

• May want to denormalize data model in database

• No transactions

Andriy Rymar
One obvious benefit of having such a flexible data storage mechanism is that you can have arbitrary number of cells with customized names and have a partition key store data as a list of tuples (a tuple is an ordered set; in this case, the tuple is a key-value pair). This comes handy when you have to store things such as time series, for example, if you want to use Cassandra to store your Facebook timeline or your Twitter feed or you want the partition key to be a sensor ID and each cell to represent a tuple with name as the timestamp when the data was created and value as the data sent by the sensor. Also, in a partition, cells are by default naturally ordered by the cell's name. So, in our sensor case, you will get data sorted for free. The other difference is, unlike RDBMS, Cassandra does not have relations. This means relational logic will be needed to be handled at the application level. This also means that we may want to denormalize the database because there is no join and to avoid looking up multiple tables by running multiple queries. Denormalization is a process of adding redundancy in data to achieve high read performance.

Type of keys

• Primary key

• Composite key

• Partition key

• Clustering key

• Composite partition key

Andriy Rymar
Primary key: This is the column or a group of columns that uniquely defines a row of the CQL table.Composite key: This is a type of primary key that is made up of more than one column. Sometimes, the composite key is also referred to as the compound key.Partition key: Cassandra's internal data representation is large rows with a unique key called row key. It uses these row key values to distribute data across cluster nodes. Since these row keys are used to partition data, they as called partition keys. When you define a table with a simple key, that key is the partition key. If you define a table with a composite key, the first term of that composite key works as the partition key. This means all the CQL rows with the same partition key lives on one machine.Clustering key: This is the column that tells Cassandra how the data within a partition is ordered (or clustered). This essentially provides presorted retrieval if you know what order you want your data to be retrieve in.Composite partition key: Optionally, CQL lets you define a composite partition key (the first part of a composite key). This key helps you distribute data across nodes if any part of the composite partition key differs

Example 1

CREATE TABLE album ( id uuid, name name, PRIMARY KEY (id))

Primary key and also the partition key

id - partition & primary key at the same time

Andriy Rymar
There is no clustering. It is a simple key

Composite key

Example 2

CREATE TABLE author_book ( author text, book text, population int, PRIMARY KEY (author, book))

partition key primary key

Andriy Rymar
In the preceding example, we have a composite key that uses author and book to uniquely define a CQL row. The author column is the partition key, so all the rows with the same author will belong to the same node/machine. The rows within a partition will be sorted by the book names.

Example 3Key with composite partition & clustering keys

CREATE TABLE teacher_lesson ( teacher text, lesson text, topic text, duration int, PRIMARY KEY ((teacher, lesson), topic, duration))

clustering keyscomposite partition key

Andriy Rymar
The preceding example has a composite key involving four columns: teacher, lesson, topic, and duration, with teacher and lesson constituting composite partition key. This means the rows with the same teacher but different lesson will be in a different partition. Rows will be ordered by the topic followed by the duration.

Row vs PartitionRows

Partitions

Node 1 Node 2

1234

5678

9101112

1234:user 5678:user

1234:address 5678:address

1234:details 5678:details

Coffee break

• General overview

• Data model

• Architecture

• Read & Write operations

CassandraArchitecture

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Cassandra components

API Tools

Storage layer

Partitioner Replicator

Failure detector Compaction Manager

Messaging layer

Messaging service

In cluster of 5 nodes , each node has 8 opened socket connections

Has 2 opened socket connections with every other node

Gossip

GossipHow Cassandra initiates sessions?

• One session for any random live node

• One session for any random unreachable node

• If the node in point 1 is not a seed node, then create session with random seed node

GossipSession

1 : GossipSyncMessage

N1 N22 : GossipAckMessage

3 : GossipAck2Message

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Failure detectionϕ accrual failure detector

• Doesn’t use TRUE / FALSE

• Provides continuos value• This value is called «ϕ»

Andriy
ϕthres can be understood like this. Let's say we start to suspect whether a node is dead when ϕ >= ϕthres. When ϕthres is 1, it is equivalent to - log(0.1). The probability that we will make a mistake (that is, the decision that the node is dead will be contradicted in future by a late arriving heartbeat) is 0.1 or 10 percent. Similarly, with ϕthres = 2, the probability of making a mistake goes down to 1 percent; with ϕthres = 3, it drops to 0.1 percent; and so on, following log base 10 formula.

Failure detectionϕ accrual failure detectortim

e

session

1 2 3 4 5

1s2s

Failure detection

Proposed by Xavier Défago in 2004

https://goo.gl/xS0kB0

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Partitioner

All terabytes of data

N1

N2

N3

N4

N5

N6

N7

N8

Partitioner

• Murmur 3 Partitioner

• Random Partitioner

• Byte Order Partitioner

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Replicator

• Replication factor = 3

Write data request

N1 N2 N3 N4N1 N2 N3

• Consistency Level = 2

N1 N2

Consistency level

• ZERO (write only)

• ANY (write only)

• ONE

• QUORUM

• ALL

Push and forget

Success even hinted of write

First replica returned successfully

N/2 +1 replica success

All replica success

Replicator

Inconsistency• 5 node cluster

• Replication factor 3

• Consistency level 1

N1 N2 N3 N4 N5

Write Read

N2 N3 N4

Replicator

Tuning

• Use consistency level with at least 1 node overlap (Quorum)

Write CL = 2 Read CL = 2

Replication factor = 3

N1 N2 N3 N4 N5

Write Read

N2 N3 N4

Replicator

Tuning

• Tune read and write CL separately to reach high performance

Fast write Fast Read

Write CL = 1 Read CL = ALL Write CL = ALL Read CL = 1

Replicator

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Storage layer

Client

Mutation Request

Commit log

MemTable

SSTable

mem

hdd

add / update

appendFlush

cleanup

Storage layer

Client

Mutation Request

Commit log

MemTable

SSTable

mem

hdd

add / update

appendFlush

cleanup

SSTable

• Representation of MemTable

• Immutable

• Eventually get merged into larger SSTable files (compaction)

• Has next components• Bloom filter

• Index file

• Data file

SSTableBloom filter

• Bloom filter is used to determine correct SSTable

• Bloom filter may result as FALSE positive

• Stored on heap memory

SSTableBloom filter

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 1 0

0 0 5 0 0 2 0 1 1 3 0 0 1 0 0 2 0

murmur3(`key`) = 15

SSTableIndex file

• Contains all row keys and their offset in data file

• Each 128th key from the file will be stored into memory

• Use binary search to determine right index in memory

SSTableIndex file

memory hdd

126,…127,…128,…

…199,…200,…201,…202,…203,…

Index file

1, 128, 256, 384

Sampled IndexBF

201

Cassandra components

API ToolsStorage

layerPartitioner Replicator

Failure detector Compaction Manager

Messaging layer

Compaction

• Merges SSTables

• There are two compaction strategies• size-tiered

• leveled

Compaction

A B A B C C D E C

C F

Size-tiered (Minor compaction)

D E C F …

Compaction Level = 2

Compaction

B C DE

Size-tiered (Major compaction)

F

A

Data repair

• Hinted handoff

• Read repair

• Anti-entropy

Coffee break

• General overview

• Data model

• Architecture

• Read & Write operations

CassandraRead & Write operations

Write

Write Request

Node1

Stor

age

Prox

y

Node2

Commit Log

MemTable

SSTable

WriteReplication & Consistency

N1

N4

Replication Factor = 3

N2 N3

Consistency Level = 2

Anti-entropy / Read repair

Hinted handoff

Read

Snitch Function

Read request

Node1

Stor

age

Prox

y

?

ReadSnitch Function

• SimpleSnitch

• DynamicSnitch

• PropertyFileSnitch

• GossipingPropertyFileSnitch

• RackInferringSnitch

ReadSnitch functions

SimpleSnitch

N1N2

N3

N4N5

N6

N7

ReadSnitch functions

DynamicSnitch

N1

N1

N2

N3

0.6ms

0.4ms

0.9ms

1

2

3

ReadSnitch functions

GossipingPropertyFileSnitch

$CASSANDRA_HOME/conf/cassandra-rackdc.properties

# indicate the rack and dc for this nodedc=DC1rack=RAC1

ReadIn action

• 7 node • RF = 4 • CL = 3

Read request

Node1St

orag

e Pr

oxy Read data

Get digestGet

digest

Node 3

Node 4

Node 5

Node 6

Node 7

Node 2

Node 3

Node 4

Node 5

Read on node

B

SSTable

I

B B I

B

Almost the end

Nishant Neeraj : «Mastering Apache Cassandra - Second Edition»

Throughput comparison

PublicationA Real Comparison Of NoSQL Databases HBase, Cassandra & MongoDB

https://goo.gl/z5abRu

Summary

The end

Anyquestions?problems

Resources

Nishant Neeraj Mastering Apache Cassandra - 2015

http://docs.datastax.com/en/cassandra/3.x/cassandra/

cassandraAbout.html

Thank you