Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enterprise
Introduction to Apache Cassandra
-
Upload
robert-stupp -
Category
Data & Analytics
-
view
433 -
download
0
Transcript of Introduction to Apache Cassandra
Introduction to
Apache
1
Me
Robert StuppFreelancer, Coder, Architect @snazy [email protected]
Contributor to Apache Cassandra, 3.0 UDFs (CASSANDRA-7395 + related)
Databases, Network, Backend
2
Agenda
Apache Cassandra History
Design Principles
Outstanding differences
CQL Intro
Access C*
Clusters
Cassandra Future
3
Apache Cassandra History
4
Apache Cassandrastarted at Facebook
inspired by
Note: Facebook initially had two data centers.
5
2.1 released in Sep 2014
6
Apache Cassandra Design Principles
7
Hardware failurescan and will occur!
Cassandra handles failures. From single node to whole data center.
From client to server.
8
The complicated part when learning Cassandra,
is to understand
Cassandra’s simplicity
9
Keep it simpleall nodes are equal
master-less architecture
no name nodes
no SPOF (single point of failure)
no read before modify(prevent race conditions)
10
Keep it running
No need to take cluster down … e.g.
during maintenance
during software update
Rolling restart is your friend
11
Outstanding Differences
12
Cassandra
Highly scalableruns with a few nodes up to 1000+ nodes cluster!
Linear scalability (proven!)
Multi datacenter aware (world-wide!)
No SPOF
13
Cassandra @ Apple
14
Linear Scalability
15
Scaling Cassandra
More data?-> add more nodes
Faster access?-> add more nodes
16
Read / Write performance
Reads are fast
Writes are even faster
17
Durability
Writes are durable - period.
18
Availability @ Netflix
19
Chaos Monkey
kills nodes randomly
Availability @ Netflix
20
Chaos Gorilla
kill regions randomly
Availability @ Netflix
21
Chaos Kong
kills whole data centers
Availability @ Netflix
22
http://de.slideshare.net/planetcassandra/active-active-c-behind-the-scenes-at-
netflix
32 node cluster (Rasperry PIs) @DataStax
23
Most outstanding
Great documentation
Many blog posts
Many presentations
Many videos
Regular webinars
Huge, active and healthy community
24
Data Distribution
25
DHT
Data is organized in a
„Distributed Hash Table“
(hash over row key)
26
DHT
27
0
1
2
3
4
5
6
7
Replication
28
Replication Factor 2
29
0
1
2
3
4
5
6
7
Row A
Row B
Replication Factor 3
30
0
1
2
3
4
5
6
7
Row A
Row B
Consistency
Consistency defined per request
Several consistency levels (CLs)for different needs
31
Eventual consistency
is not hopefully consistent
32
EC means there’s a time gap until updates are consistently readable
Consistency Levels
ANY (only for writes)
ONE, LOCAL_ONE,
TWO, THREE, (not recommended)
ALL, (not recommended)
QUORUM, LOCAL_QUORUM, EACH_QUORUM
SERIAL, LOCAL_SERIAL
33
Consistency
Data is always replicated
CL defines how many replicas must fulfill the request
34
Write
35
0
1
2
3
4
5
6
7
Write
Write
36
0
1
2
3
4
5
6
7
Write
Mutli DC setup
37
DC 1 DC 2
Multi DC replication
38
WriteDC 1 DC 2
Mutli DC replication
39
WriteDC 1 DC 2
Mutli DC replication
40
WriteDC 1 DC 2
Replication &Consistency
Define # of replicasusing replication factor
Define required consistencyper request
41
CQL Introduction
CQL = Cassandra query language
42
“CQL is SQL minus joins,
minus subqueries, plus collections”
(plus user types, plus tuple types)
43
Why CQL?
Introduces a schema to Cassandra
Familiar syntax
Easy to understand
DML operations are atomic
44
Data model(hierarchical view)Keyspace (schema)
Table (column family)
Row
partition key (part of primary key)
static columns
clustering key (part of primary key)
columns
45
CQL / DDL
Similar to SQL
CREATE TABLE …
ALTER TABLE …
DROP TABLE …
46
CQL / DML
Similar to SQL
INSERT …
UPDATE …
DELETE …
SELECT …
47
CQL / BATCH
Group related modifications(INSERT, UPDATE, DELETE)
Atomic operation
48
CQL types
boolean, int (32bit), bigint (64bit),
float, double,
decimal ("BigDecimal"), varint ("BigInteger"),
ascii, text (= varchar), blob,
inet, timestamp, uuid, timeuuid
49
CQL collection types
list < foo >
set < foo >
map < foo , bar >
50
Since C* 2.1 collections can contain
any type - even other collections.
CQL composite types
user types (C* 2.1)are composite types with named fields
tuple types (C* 2.1)are unstructured lists of values
51
CQL / user types
CREATE TYPE address ( street text, zip int, city text); CREATE TABLE users ( username text, addresses map<text, address>, ...
52
CassandraData Modeling
Access by keyno access by arbitrary WHERE clause
Duplicate data (it’s ok!)
Aggregate data
Build application maintained indexes
53
RDBMS modeling
54
C* modeling
55
Data Modeling with RDBMS
Driven by
"How can I store something right?"
"What answersdo I have?"
56
Data Modeling with NoSQL
Driven by
"How can I access something right?"
"What questionsdo I have?"
57
Data Modeling Basics
Work top-down. Think about:
What does the application do?
What are the access patterns?
Now design data model
58
Data Modeling
59
http://de.slideshare.net/planetcassandra/cassandra-day-sv-2014-fundamentals-of-apache-cassandra-data-modeling
http://de.slideshare.net/planetcassandra/data-modeling-with-travis-price
Accessing Cassandra
60
Command Line
cqlsh CQL shell
nodetoolnode/cluster administration
61
GUI: DevCenter
Visual query tool
62
Stress test?
Cassandra 2.1 comes with improved stress tool
Simulate read+write workload
Uses configurable data
Works against older C* versions, too
63
DataStax APLv2Open Source Drivers
for Java
for Python
for C#
for Scala / Spark
https://github.com/datastax/ or http://www.datastax.com/download
64
Native protocol
C*’s own net protocol for clients
Request multiplexing
Schema change notifications
Cluster change notifications
65
Third Party Drivers
for huge number of languages
66
Mappers
High level mappers exist at least for Java
Special case: Scaladue to its strong+complex type model (DataStax OSS Spark driver)
67
Spark + Hadoop
Yes - works really good
Note: Spark is about 100x faster
68
Clusters
69
Cluster sizes
C* works with a few nodes
C* works with several hundred / thousand nodes
70
Cluster setup
Configure for multiple data centers
Plan for multi-DC setup :)
71
Cluster experience
Remember: A single Cassandra clusters works over multiple data centers all over the world
„Desaster proven“
Hurricanes
Amazon DC outages
72
Apache CassandraFuture
73
Cassandra 3.0 (in development)
User Defined Functions
Aggregate functions
Functional indexes
Workload recording + playback
Better SSTables, Fully off-heap row cache, Better serial consistency
Indexes w/ high cardinality
74
Subject to
change!!!
Get active !
75
Cassandra Community
http://cassandra.apache.org/
http://planetcassandra.org/ - Blog
http://www.slideshare.net/planetcassandra/presentations
http://de.slideshare.net/DataStax/presentations
76
Cassandra Community
https://www.youtube.com/user/PlanetCassandra
https://www.youtube.com/user/DataStax
http://www.datastax.com/dev/blog/
http://www.datastax.com/docs/
Users Mailing List [email protected]
77
Free C* Training!
http://planetcassandra.org/cassandra-training/
78
Get involved!
Ask questions, submit RFEs or experiences to
user mailing list
Answers arrive quickly!
79
Live DemoUser Defined Functions
80
C* 3.0 UDFs
Users create functions usingCREATE FUNCTION … LANGUAGE … AS …
Java, JavaScript, Scala, Groovy, JRuby, Jython
Functions work on all nodes
81
C* 3.0 UDFs
Example
CREATE FUNCTION sin(input double) RETURNS double LANGUAGE javascript AS 'Math.sin(input)';
82
This is JavaScript!
UDFs for what?
Own aggregation code - e.g. SELECT sum(value) FROM table WHERE …;
Functional indexes - e.g. CREATE INDEX idx ON table ( myFunction(colname) );
83
Targeted for C* 3.0
Thanksfor your attention
Robert Stupp@[email protected]/RobertStupp
Download Apache Cassandra at http://cassandra.apache.org/
84
Q & A
85
86
BACKUP SLIDESUser-Defined-Functions
Demo
87
88
89
90
91
92
93
94
95
96
97
98
99