Intro to Cassandra

CassandraIntro to

Tyler Hobbs

Dynamo(clustering)

History

BigTable(data model)

Cassandra

Every node plays the same role– No masters, slaves, or special nodes

– No single point of failure

Clustering

Consistent Hashing

Key: “www.google.com”

Consistent Hashing

md5(“www.google.com”)

Consistent Hashing

Replication Factor = 3

Consistent Hashing

Client can talk to any node

Clustering

Scaling

The node at50 owns the red portion

RF = 2

Scaling

40Add a new node at 40

RF = 2

Scaling

40Add a new node at 40

RF = 2

Node Failures

RF = 2

Replicas

Node Failures

RF = 2

Replicas

Node Failures

RF = 2

Consistency, Availability Consistency

– Can I read stale data? Availability

– Can I write/read at all? Tunable Consistency

Consistency N = Total number of replicas R = Number of replicas read from

– (before the response is returned) W = Number of replicas written to

– (before the write is considered a success)

Consistency N = Total number of replicas R = Number of replicas read from

– (before the response is returned) W = Number of replicas written to

– (before the write is considered a success)

W + R > N gives strong consistency

Consistency

N = 3W = 2R = 2

2 + 2 > 3 ==> strongly consistent

Consistency

N = 3W = 2R = 2

2 + 2 > 3 ==> strongly consistent

Only 2 of the 3 replicas must be available.

Consistency Tunable Consistency

– Specify N (Replication Factor) per data set– Specify R, W per operation

Consistency Tunable Consistency

– Specify N (Replication Factor) per data set– Specify R, W per operation– Quorum: N/2 + 1

• R = W = Quorum• Strong consistency• Tolerate the loss of N – Quorum replicas

– R, W can also be 1 or N

Availability Can tolerate the loss of:

– N – R replicas for reads– N – W replicas for writes

CAP Theorem

Availability

Consistency

During node or network failure:

Possible

Not Possible

CAP Theorem

Availability

Consistency

During node or network failure:

Cassandra

Not Possible

Possible

No single point of failure Replication that works Scales linearly

– 2x nodes = 2x performance• For both writes and reads

– Up to 100's of nodes Operationally simple Multi-Datacenter Replication

Clustering

Comes from Google BigTable Goals

– Minimize disk seeks– High throughput– Low latency– Durable

Data Model

Keyspace– A collection of Column Families– Controls replication settings

Column Family– Kinda resembles a table

Data Model

Static– Object data– Similar to a table in a relational database

Dynamic– Pre-calculated query results– Materialized views

Column Families

Static Column Families

zznate

driftx

thobbs

jbellis

password: *

password: * name: Jonathan site: riptano.com

Rows– Each row has a unique primary key– Sorted list of (name, value) tuples

• Like a sorted map or dictionary– The (name, value) tuple is called a “column”

Dynamic Column Families

zznate

driftx

thobbs

jbellis

driftx: thobbs:

driftx: thobbs:mdennis: zznate

Following

zznate:

pcmanus xedin:

Column Timestamps– Each column (tuple) has a timestamp– In the case of a collision, the latest timestamp wins– Client specifies timestamp with write– Writes are idempotent

• Infinite retries allowed

Dynamic Column Families

Dynamic Column Families Other Examples:

– Timeline of tweets by a user– Timeline of tweets by all of the people a user is

following– List of comments sorted by score– List of friends grouped by state

The Data API Two choices

– RPC-based API– CQL

• Cassandra Query Language

Inserting Data

INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);

Updating Data

INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34);

Updates are the same as inserts:

UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;

Fetching Data

SELECT * FROM users WHERE KEY = “thobbs”;

Whole row select:

Fetching Data

SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;

Explicit column select:

Fetching Data

UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e' WHERE KEY = “key”;

SELECT 1..3 FROM letters WHERE KEY = “key”;

Get a slice of columns

Returns [(1, a), (2, b), (3, c)]

Fetching Data

SELECT FIRST 2 FROM letters WHERE KEY = “key”;

Returns [(1, a), (2, b)]

SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”;

Returns [(5, e), (4, d)]

Fetching Data

SELECT 3..'' FROM letters WHERE KEY = “key”;

Returns [(3, c), (4, d), (5, e)]

SELECT FIRST 2 REVERSED 4..'' FROM letters WHERE KEY = “key”;

Returns [(4, d), (3, c)]

Deleting Data

DELETE FROM users WHERE KEY = “thobbs”;

Delete a whole row:

DELETE “age” FROM users WHERE KEY = “thobbs”;

Delete specific columns:

Secondary Indexes

CREATE INDEX ageIndex ON users (age);

SELECT name FROM USERS WHERE age = 24 AND state = “TX”;

Builtin basic indexes

Performance Writes

– 10k – 30k per second per node– Sub-millisecond latency

Reads– 1k – 10k per second per node– Depends on data set, caching– Usually 0.1 to 10ms latency

Other Features Distributed Counters

– Can support millions of high-volume counters Excellent Multi-datacenter Support

– Disaster recovery– Locality

Hadoop Integration– Isolation of resources– Hive and Pig drivers

Compression

What Cassandra Can't Do Transactions

– Unless you use a distributed lock– Atomicity, Isolation– These aren't needed as often as you'd think

Limited support for ad-hoc queries– Know what you want to do with the data

Not One-size-fits-all Use alongside an RDBMS

– Use the RDBMS for highly-transactional or highly-relational data• Usually a small set of data

– Let Cassandra scale to handle the rest

Language Support Good:

– Java– Python– Ruby– PHP– C#

Coming Soon:– Everything else, now that we have CQL

Tyler Hobbs@tylhobbs

tyler@datastax.com

Questions?

Intro to Cassandra

Technology

Transcript of Intro to Cassandra

Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons

Cassandra Summit 2015: Intro to DSE Search

Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known

Running Cassandra on Amazon’s ECS - Meetupfiles.meetup.com/7439192/Cassandra-ECS.pdf · • Cassandra • ECS • Cassandra on Docker best practices • Cassandra on ECS. Motivation.

CORPORATE COURSE CATALOGCourse+Catalog+2019... · INTRO TO CASSANDRA 3 FOR DEVELOPERS The Cassandra (C*) database is a massively scalable NoSQL database that provides high availability

Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Enterprise

A GUIDE TO STRESS TESTING KAFKA, SPARK AND CASSANDRA … · Spark Workers. The nodes are named Spark-Cassandra-Master, Spark-Cassandra-Worker01 and Spark-Cassandra-Worker02. The Cassandra

Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Cassandra - how to fail?

Cassandra intro

CASSANDRA - Next to RDBMS

Intro to Cassandra

Introduction to Cassandra Basics

Introduction to Apache Cassandra

An Introduction To Cassandra

Intro to Cassandra and CassandraObject

Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enterprise

Introduction to Cassandra • Why Spark + Cassandra ... · • Introduction to Cassandra • Why Spark + Cassandra • Problem background and overall architecture •Implementation

Cassandra Intro -- TheEdge2012

Crash course intro to cassandra