BigData Developers MeetUp

52
Christian Johannsen, Solutions Engineer Evaluating Apache Cassandra as a Cloud Database

description

BigData Developers MeetUp in Berlin at IBM

Transcript of BigData Developers MeetUp

Page 1: BigData Developers MeetUp

Christian Johannsen, Solutions Engineer

Evaluating Apache Cassandra as a Cloud Database

Page 2: BigData Developers MeetUp

About me

2

Christian JohannsenSolutions Engineer @ DataStax

Twitter: @cjohannsen81

Page 3: BigData Developers MeetUp

What is a cloud database not?

3

• A Cloud database is not simply taking a traditional RDBMS and

running it in a Cloud provider’s environment

Page 4: BigData Developers MeetUp

What is a cloud database?

4

• A cloud database is a database optimised to handle the

infrastructure limitations and features of running on

shared, geographically dispersed infrastructure.

• Some of them are: Outages, noisy neighbours, security,

variable networks etc…

Page 5: BigData Developers MeetUp

Key attributes of a cloud database

5

• Transparent Elasticity

• being able to add and subtract nodes (physical or virtual machines) with load balancing when the underlying application and business demands it.

• Transparent Scalability

• addition of nodes increases both (1) performance throughput; (2) ability to

handle Big Data and maintain high performance

• High Availability

• always up; no single point of failure

• Easy Data Distribution

• able to span multiple geographies, data centers, and cloud provider

zones. Can read/write to any node

Page 6: BigData Developers MeetUp

6

Key attributes of a cloud database

• Support for multiple data types

• Able to handle structured, semi-structured, and unstructured data.

• Easy to manage

• easy to administer a logical database across many nodes

• Lower Cost

• It shouldn’t break the bank and you shouldn’t have to have a massive upfront cost

• Multiple Infrastructure Support

• It should support being deployed on multiple cloud providers (public and

private) as well as traditional infrastructure

• Security

• Data should be secure in flight, at rest and audited.

Page 7: BigData Developers MeetUp

Key Attributes at a glance

7

1. Transparent Elasticity

2. Transparent Scalability

3. High Availability

4. Easy Data Distribution

5. Data Redundancy

6. Support for multiple data types

7. Easy to manage

8. Lower Cost

9. Multiple Infrastructure Support

10. Security

Page 8: BigData Developers MeetUp

Apache Cassandra

8

Page 9: BigData Developers MeetUp

Apache Cassandra

9

• Cassandra is a massively scalable, open source, NoSQL, distributed database built for modern, mission-critical online applications.

• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable

• Masterless with no single point of failure

• Distributed, data centre and rack aware

• 100% uptime

• Predictable scaling Dynamo

BigTable

BigTable: http://research.google.com/archive/bigtable-osdi06.pdf

Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Page 10: BigData Developers MeetUp

Apache Cassandra Overview

10

• Cassandra was designed with the understanding that system/hardware failures

can and do occur

• Peer-to-peer, distributed system

• All nodes the same - masterless with no single point of failure

• Data partitioned among all nodes in the cluster

• Custom data replication to ensure fault tolerance

• Tunable data consistency

• Data Center and Rack aware

• Familiar SQL-Like language – CQL

• More capacity? Add a server!

• More throughput? Add a server!

Page 11: BigData Developers MeetUp

Cassandra Adoption

11

Page 12: BigData Developers MeetUp

Cassandra - Locally Distributed

12

• Replication factor (RF): How many copies of your data?

• RF = 3 in this example

• Each node is storing 60% of the clusters total data i.e. 3/5

Handy Calculator: http://www.ecyrd.com/cassandracalculator/

Node 1

1st copy

Node 4

Node 5

Node 2

2nd copy

Node 3

3rd copy

• Client reads or writes to any node

• Node coordinates with others

• Data read or replicated in parallel

Node 2

2nd copy

Page 13: BigData Developers MeetUp

Cassandra - Rack/Zone aware

13

Node 1

1st copy

Node 4

Node 2

Node 3

2nd copy

Rack 1

Rack 2Rack 2

Rack 3

Rack 1

Node 5

3rd copy

• Cassandra is aware of which rack or zone each nod resides in

• It will attempt to place each data copy in a different rack

• RF=3 in this example

Page 14: BigData Developers MeetUp

Cassandra - DC/Region aware

14

• Active Everywhere – reads/writes in multiple data centres

• Client writes local

• Data syncs across WAN

• Replication Factor per DC

• Different number of nodes per

data center

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

DC: EUROPEDC: USA

Page 15: BigData Developers MeetUp

Cassandra - Tuneable Consistency

15

• Consistency Level (CL)

• Client specifies per operation

• Handles multi-data center operations

ALL = All replicas ack

QUORUM = > 51% of replicas ack

LOCAL_QUORUM = > 51% in local DC ack

ONE = Only one replica acks

Plus more…. (see docs)

Blog: Eventual Consistency != Hopeful Consistency

http://planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-

consistency-by-christos-kalantzis/

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

Parallel

Write

Write

CL=QUORUM

5 μs ack

12 μs ack

500 μs ack

12 μs ack

Page 16: BigData Developers MeetUp

Cassandra - Node failure

16

• A single node failure shouldn’t bring failure.

• Replication Factor + Consistency Level = Success

• This example:

• RF = 3

• CL = QUORUM

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

Parallel

Write

Write

CL=QUORUM

5 μs ack

12 μs ack

12 μs ack

>51% ack – so request is a success

Page 17: BigData Developers MeetUp

Cassandra - Node Recovery

17

• When a write is performed and a replica node for the row is unavailable the

coordinator will store a hint locally (3 hours)

• When the node recovers, the coordinator replays the missed writes.

• Note: a hinted write does not count the consistency level

• Note: you should still run repairs across your cluster

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

Stores Hints while Node 3 is offline

Page 18: BigData Developers MeetUp

Cassandra Rack/Zone Failure

18

• Cassandra will place the data in as many

different racks or availability zones as it can.

• This example:

• RF = 3

• CL = QUORUM

• AZ/Rack 2 fails

• Data copies still available in Node 1 and

Node 5

• Quorum can be honored i.e. > 51% ack

Node 1

1st copy

Node 4

Node 2

Node 3

2nd copy

Rack 1

Rack 2Rack 2

Rack 3

Rack 1

Node 5

3rd copy

request is a success

Page 19: BigData Developers MeetUp

Operational Simplicity

19

• Cassandra is a complete product – there is not a multitude

of components to install, set-up and monitor.

• Extremely simple to administer and deploy

• Backups are instantaneous and simple to restore• Supports snapshots, incremental backups and point-in-time recovery.

• Cassandra can handle non-uniform hardware and disks. o This enables the mixing of solid state and spinning disks in a single cluster and pinning

tables to workload-appropriate disks.

• No downtime is required in Cassandra for upgrades or

adding/removing servers from the cluster.

Page 20: BigData Developers MeetUp

What ´s up with DataStax?

20

Page 21: BigData Developers MeetUp

DataStax at a glance

21

Founded in April 2010

~25 500+

Santa Clara, Austin, New York, London, Sydney

330+Employees Percent Customers

Page 22: BigData Developers MeetUp

DataStax delivers value

22

Certified,

Enterprise-ready

Cassandra

Security Analytics Search VisualMonitoring

ManagementServices

In-Memory

Dev. IDE & Drivers

ProfessionalServices

Support & Training

Commercial

Confidence

Enterprise

Functionality

Page 23: BigData Developers MeetUp

Enterprise Integrations

23

• DataStax adds Enterprise Features like: Hadoop, Solr,

Spark

Page 24: BigData Developers MeetUp

DataStax OpsCenter

24

• DataStax OpsCenter is a browser-based, visual management and

monitoring solution for Apache Cassandra and DataStax Enterprise

• Functionality is also exposed via HTTP APIs

Page 25: BigData Developers MeetUp

OpsCenter - New Cluster Example

25

A new, 10-node Cassandra (or Hadoop) cluster with OpsCenter running in 3 minutes… A new, 10-node DSE cluster with OpsCenter running on AWS in 3 minutes…

Done1 2 3

Page 26: BigData Developers MeetUp

OpsCenter 5.0

26

• Manage multiple clusters and nodes

• Add and remove nodes

• Administer individual nodes or in bulk

• Configure clusters

• Perform rolling restarts

• Automatically repair data

• Rebalance data

• Backup management

• Capacity planning

Page 27: BigData Developers MeetUp

DataStax Office Demo

27

• 32 Raspberry Pi´s

• 16 per DataStax Enterprise 4.5 Cluster

• Managed in OpsCenter 5.0

• “Red Button” downs one DataCenter

• Not the Performance-Demo but

• Availability

• Commodity Hardware

Page 28: BigData Developers MeetUp

Native Drivers

28

• Different Native Drivers available: Java, Python etc.

• Load Balancing Policies (Client Driver receives Updates)• Data Centre Aware

• Latency Aware

• Token Aware

• Reconnection policies

• Retry policies

• Downgrading Consistency

• Plus others..

• http://www.datastax.com/download/clientdrivers

Page 29: BigData Developers MeetUp

CQL

29

• Cassandra Query Language

• CQL is intended to provide a common, simpler and easier to use

interface into Cassandra - and you probably already know it!

• e.g. SELECT * FROM users

• Usual statements:

• CREATE / DROP / ALTER TABLE / SELECT

Page 30: BigData Developers MeetUp

CQL Basics

30

CREATE KEYSPACE league WITH REPLICATION = {‘class’:’NetworkTopologyStrategy’, ‘DataCentre1’:3, ‘DataCentre2’: 2};

USE league;

CREATE TABLE teams (

team_name varchar,

player_name varchar,

jersey int,

PRIMARY KEY (team_name, player_name)

);

SELECT * FROM teams WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’;

INSERT INTO teams (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);

Page 31: BigData Developers MeetUp

CQL Data Types

31

Page 32: BigData Developers MeetUp

DevCenter 1.1

32

• Visual Query Tool for Developers and Administrators

• Easily create and run Cassandra Queries

• Visually navigate database objects

• Context-based suggestions

Page 33: BigData Developers MeetUp

Do you remember the facts?

33

Page 34: BigData Developers MeetUp

Key Attributes of a cloud database

34

1. Transparent Elasticity

2. Transparent Scalability

3. High Availability

4. Easy Data Distribution

5. Data Redundancy

6. Support for multiple data types

7. Easy to manage

8. Lower Cost

9. Multiple Infrastructure Support

10. Security

Page 35: BigData Developers MeetUp

Transparent Elasticity

35

• Nodes can be added and removed from Cassandra online, with no

downtime being experienced.

1

2

3

4

5

6

1

7

104

2

3

5

6

8

9

11

12

Page 36: BigData Developers MeetUp

Transparent Scalability

36

• Addition of Cassandra nodes increases performance linearly and

ability to manage TB’s-PB’s of data.

1

2

3

4

5

6

1

7

104

2

3

5

6

8

9

11

12

Performance

throughput = N

Performance

throughput = N x 2

Page 37: BigData Developers MeetUp

High Availability

37

• Cassandra, with its peer-to-peer masterless architecture has no

single point of failure.

• 100% uptime is the norm!

Page 38: BigData Developers MeetUp

Easy Data Distribution

38

• Cassandra allows a single logical database to span 1-N Data

Centers that are geographically dispersed.

• Also supports a hybrid Cloud implementation.

Page 39: BigData Developers MeetUp

Data Redundancy

39

• Cassandra allows for customisable data redundancy so that data

is completely protected.

• Also supports rack/zone awareness (data can be replicated

between different racks to guard against machine/rack failures).

Page 40: BigData Developers MeetUp

Support for multiple data types

40

• Cassandra’s data model (based on Google’s Bigtable) allows a

user to store structured, semi-structured, and unstructured data

with ease.

ID Name SSN DOB

Portfolio Keyspace

Customer Table

Page 41: BigData Developers MeetUp

Easy to manage

41

• OpsCenter, AMI installers etc.. allow you to install and configure

an entire multi-node Cloud implementation in minutes.

• All can be managed and monitored via Web-based console or via

RESTful APIs.

Page 42: BigData Developers MeetUp

Lower Cost

42

• Cassandra is open source software and is freely available.

• Commercial/advanced versions of Cassandra are available from

DataStax along with support and additional services.

Page 43: BigData Developers MeetUp

Multiple Infrastructure Support

43

• Cassandra is supported on popular Cloud provider platforms and

operating systems.

Page 44: BigData Developers MeetUp

Security

44

• Cassandra has multiple security features - authentication,

authorisation, inflight encryption

• Advanced “enterprise level” security can also provided via

DataStax Enterprise

Page 45: BigData Developers MeetUp

What´s new?!

45

Page 46: BigData Developers MeetUp

What is Spark?

46

• Apache Project since 2010

• 10-100x faster than Hadoop MapReduce

• In-Memory Storage

• Single JVM Processor per node

• Rich Scala, Java and Python API´s

• 2x-5x less code

• Interactive Shell

Page 47: BigData Developers MeetUp

Why Spark on Cassandra?

47

• Data model independent queries

• cross-table operations (JOIN, UNION, etc.)!

• complex analytics (e.g. machine learning)

• data transformation, aggregation etc.

• stream processing (coming soon)

• all nodes are Spark workers

• by default resilient to worker failures

• first node promoted as Spark Master

• Standby Master promoted on failure

• Master HA available in Dactastax Enterprise

Page 48: BigData Developers MeetUp

How to Spark on Cassandra?

48

• DataStax Cassandra Spark Driver on GitHub

• Compatible with:

• Spark 0.9+

• Cassandra 2.0+

• DataStax Enterprise 4.5+

• Cassandra Tables exposes as Spark RDDs (Resilient Distributed

Datasets

• Read from and write to Cassandra

• Mapping of Cassandra tables and rows to Scala objects

• Scala only driver for now

Page 49: BigData Developers MeetUp

Spark type mapping

49

Page 50: BigData Developers MeetUp

2.1 Release - User Defined Types

50

CREATE TYPE address (

street text,

city text,

zip_code int,

phones set<text>

)

CREATE TABLE users (

id uuid PRIMARY KEY,

name text,

addresses map<text, address>

)

SELECT id, name, addresses.city, addresses.phones FROM users;

id | name | addresses.city | addresses.phones

--------------------+----------------+--------------------------

63bf691f | chris | Berlin | {’0201234567', ’0796622222'}

Page 51: BigData Developers MeetUp

2.1 Release - Secondary Indexes on

collections

51

CREATE TABLE songs (

id uuid PRIMARY KEY,

artist text,

album text,

title text,

data blob,

tags set<text>

);

CREATE INDEX song_tags_idx ON songs(tags);

SELECT * FROM songs WHERE tags CONTAINS 'blues';

id | album | artist | tags | title

----------+---------------+-------------------+-----------------------+------------------

5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind

Page 52: BigData Developers MeetUp

Thanks! Let´s see a demo!

52