Introduction to Apache Cassandra

Introduction to

Apache

1

Me

Robert StuppFreelancer, Coder, Architect @snazy [email protected]

Contributor to Apache Cassandra, 3.0 UDFs (CASSANDRA-7395 + related)

Databases, Network, Backend

2

mailto:[email protected]

Agenda

Apache Cassandra History

Design Principles

Outstanding differences

CQL Intro

Access C*

Clusters

Cassandra Future

3

Apache Cassandra History

4

Apache Cassandrastarted at Facebook

inspired by

Note: Facebook initially had two data centers.

5

2.1 released in Sep 2014

6

Apache Cassandra Design Principles

7

Hardware failurescan and will occur!

Cassandra handles failures. From single node to whole data center.

From client to server.

8

The complicated part when learning Cassandra,

is to understand

Cassandra’s simplicity

9

Keep it simpleall nodes are equal

master-less architecture

no name nodes

no SPOF (single point of failure)

no read before modify(prevent race conditions)

10

Keep it running

No need to take cluster down … e.g.

during maintenance

during software update

Rolling restart is your friend

11

Outstanding Differences

12

Cassandra

Highly scalableruns with a few nodes up to 1000+ nodes cluster!

Linear scalability (proven!)

Multi datacenter aware (world-wide!)

No SPOF

13

Cassandra @ Apple

14

Linear Scalability

15

Scaling Cassandra

More data?-> add more nodes

Faster access?-> add more nodes

16

Read / Write performance

Reads are fast

Writes are even faster

17

Durability

Writes are durable - period.

18

Availability @ Netflix

19

Chaos Monkey

kills nodes randomly


20

Chaos Gorilla

kill regions randomly


21

Chaos Kong

kills whole data centers


22

http://de.slideshare.net/planetcassandra/active-active-c-behind-the-scenes-at-

netflix

http://de.slideshare.net/planetcassandra/active-active-c-behind-the-scenes-at-netflix

32 node cluster (Rasperry PIs) @DataStax

23

Most outstanding

Great documentation

Many blog posts

Many presentations

Many videos

Regular webinars

Huge, active and healthy community

24

Data Distribution

25

DHT

Data is organized in a

„Distributed Hash Table“

(hash over row key)

26

DHT

27

0

1

2

3

4

5

6

7

Replication

28

Replication Factor 2

29

0

1

2

3

4

5

6

7

Row A

Row B

Replication Factor 3

30

0

1

2

3

4

5

6

7

Row A

Row B

Consistency

Consistency defined per request

Several consistency levels (CLs)for different needs

31

Eventual consistency

is not hopefully consistent

32

EC means there’s a time gap until updates are consistently readable

Consistency Levels

ANY (only for writes)

ONE, LOCAL_ONE,

TWO, THREE, (not recommended)

ALL, (not recommended)

QUORUM, LOCAL_QUORUM, EACH_QUORUM

SERIAL, LOCAL_SERIAL

33

Consistency

Data is always replicated

CL defines how many replicas must fulfill the request

34

Write

35

0

1

2

3

4

5

6

7

Write

Write

36

0

1

2

3

4

5

6

7

Write

Mutli DC setup

37

DC 1 DC 2

Multi DC replication

38

WriteDC 1 DC 2

Mutli DC replication

39

WriteDC 1 DC 2

Mutli DC replication

40

WriteDC 1 DC 2

Replication &Consistency

Define # of replicasusing replication factor

Define required consistencyper request

41

CQL Introduction

CQL = Cassandra query language

42

“CQL is SQL minus joins,

minus subqueries, plus collections”

(plus user types, plus tuple types)

43

Why CQL?

Introduces a schema to Cassandra

Familiar syntax

Easy to understand

DML operations are atomic

44

Data model(hierarchical view)Keyspace (schema)

Table (column family)

Row

partition key (part of primary key)

static columns

clustering key (part of primary key)

columns

45

CQL / DDL

Similar to SQL

CREATE TABLE …

ALTER TABLE …

DROP TABLE …

46

CQL / DML

Similar to SQL

INSERT …

UPDATE …

DELETE …

SELECT …

47

CQL / BATCH

Group related modifications(INSERT, UPDATE, DELETE)

Atomic operation

48

CQL types

boolean, int (32bit), bigint (64bit),

float, double,

decimal ("BigDecimal"), varint ("BigInteger"),

ascii, text (= varchar), blob,

inet, timestamp, uuid, timeuuid

49

CQL collection types

list < foo >

set < foo >

map < foo , bar >

50

Since C* 2.1 collections can contain

any type - even other collections.

CQL composite types

user types (C* 2.1)are composite types with named fields

tuple types (C* 2.1)are unstructured lists of values

51

CQL / user types

CREATE TYPE address ( street text, zip int, city text); CREATE TABLE users ( username text, addresses map<text, address>, ...

52

CassandraData Modeling

Access by keyno access by arbitrary WHERE clause

Duplicate data (it’s ok!)

Aggregate data

Build application maintained indexes

53

RDBMS modeling

54

C* modeling

55

Data Modeling with RDBMS

Driven by

"How can I store something right?"

"What answersdo I have?"

56

Data Modeling with NoSQL

Driven by

"How can I access something right?"

"What questionsdo I have?"

57

Data Modeling Basics

Work top-down. Think about:

What does the application do?

What are the access patterns?

Now design data model

58

Data Modeling

59

http://de.slideshare.net/planetcassandra/cassandra-day-sv-2014-fundamentals-of-apache-cassandra-data-modeling

http://de.slideshare.net/planetcassandra/data-modeling-with-travis-price

http://de.slideshare.net/planetcassandra/cassandra-day-sv-2014-fundamentals-of-apache-cassandra-data-modeling

http://de.slideshare.net/planetcassandra/data-modeling-with-travis-price

Accessing Cassandra

60

Command Line

cqlsh CQL shell

nodetoolnode/cluster administration

61

GUI: DevCenter

Visual query tool

62

Stress test?

Cassandra 2.1 comes with improved stress tool

Simulate read+write workload

Uses configurable data

Works against older C* versions, too

63

DataStax APLv2Open Source Drivers

for Java

for Python

for C#

for Scala / Spark

https://github.com/datastax/ or http://www.datastax.com/download

64

https://github.com/datastax/

http://www.datastax.com/download

Native protocol

C*’s own net protocol for clients

Request multiplexing

Schema change notifications

Cluster change notifications

65

Third Party Drivers

for huge number of languages

66

Mappers

High level mappers exist at least for Java

Special case: Scaladue to its strong+complex type model (DataStax OSS Spark driver)

67

Spark + Hadoop

Yes - works really good

Note: Spark is about 100x faster

68

Clusters

69

Cluster sizes

C* works with a few nodes

C* works with several hundred / thousand nodes

70

Cluster setup

Configure for multiple data centers

Plan for multi-DC setup :)

71

Cluster experience

Remember: A single Cassandra clusters works over multiple data centers all over the world

„Desaster proven“

Hurricanes

Amazon DC outages

72

Apache CassandraFuture

73

Cassandra 3.0 (in development)

User Defined Functions

Aggregate functions

Functional indexes

Workload recording + playback

Better SSTables, Fully off-heap row cache, Better serial consistency

Indexes w/ high cardinality

74

Subject to

change!!!

Get active !

75

Cassandra Community

http://cassandra.apache.org/

http://planetcassandra.org/ - Blog

http://www.slideshare.net/planetcassandra/presentations

http://de.slideshare.net/DataStax/presentations

76


http://planetcassandra.org/

http://www.slideshare.net/planetcassandra/presentations

http://de.slideshare.net/DataStax/presentations

Cassandra Community

https://www.youtube.com/user/PlanetCassandra

https://www.youtube.com/user/DataStax

http://www.datastax.com/dev/blog/

http://www.datastax.com/docs/

Users Mailing List [email protected]

77

https://www.youtube.com/user/PlanetCassandra

https://www.youtube.com/user/DataStax

http://www.datastax.com/dev/blog/

http://www.datastax.com/docs/


Free C* Training!

http://planetcassandra.org/cassandra-training/

78

http://planetcassandra.org/cassandra-training/

Get involved!

Ask questions, submit RFEs or experiences to

user mailing list

[email protected]

Answers arrive quickly!

79


Live DemoUser Defined Functions

80

C* 3.0 UDFs

Users create functions usingCREATE FUNCTION … LANGUAGE … AS …

Java, JavaScript, Scala, Groovy, JRuby, Jython

Functions work on all nodes

81

C* 3.0 UDFs

Example

CREATE FUNCTION sin(input double) RETURNS double LANGUAGE javascript AS 'Math.sin(input)';

82

This is JavaScript!

UDFs for what?

Own aggregation code - e.g. SELECT sum(value) FROM table WHERE …;

Functional indexes - e.g. CREATE INDEX idx ON table ( myFunction(colname) );

83

Targeted for C* 3.0

Thanksfor your attention

Robert Stupp@[email protected]/RobertStupp

Download Apache Cassandra at http://cassandra.apache.org/

84


http://de.slideshare.net/RobertStupp


Q & A

85

BACKUP SLIDESUser-Defined-Functions

Demo

87

Introduction to Apache Cassandra

Data & Analytics

Transcript of Introduction to Apache Cassandra