Spark with Cassandra by Christopher Batey

Spark with Cassandra@chbatey

@chbatey

Christopher Batey - @chbatey

Freelance Software engineer/Devops/Architect

Loves:

Breaking software

Distributed systems

Hates:

Fragile software

Untested code :(

Introduction

Audience

Assumption is…

OverviewCassandra architecture

Modelling time series - weather data for many stations

What can be done with pure C*

When to introduce Spark

What do we use Spark for?

Batch processing

Machine Learning

Ad-hoc querying of large datasets

Streaming processing

What do we use Cassandra for?

Operational Database

OLTP

Casandra overview

@chbatey

Master slave

Master

Async replication

Slave

@chbatey

Sharding

@chbatey

The other way

@chbatey

Consistent hashing

jim age: 36 car: ford gender: M

carol age: 37 car: bmw gender: F

johnny age: 12 gender: M

suzy age: 10 gender: F

Partition Key Hash value

jim 350

carol 998

johnny 50

suzy 600

Partition Key

999

49

0

50

A

B

C

D249750

749

250

B

CD

A

ExampleNode Start range End range Primary

keyHash value

A 0 249 johnny 50

B 250 499 jim 350

C 500 749 suzy 600

D 750 999 carol 998

@chbatey

Fault tolerance

Replicate each price of data on multiple nodes

Keep replicas on different racks

Datacenter aware

DC2

client

RF3 RF3

CC

WRITE CL = 1 We have

replication!

DC1

Storing weather dataCREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

@chbatey

Primary key relationshipPRIMARY KEY ((weatherstation_id),year,month,day,hour)

Data Localityweatherstation_id=‘10010:99999’ ?

1000 Node Cluster

You are here!

Primary key relationshipPRIMARY KEY ((weatherstation_id),year,month,day,hour)

Partition Key Clustering Columns

10010:99999

2005:12:1:8:temp 2005:12:1:7:temp

-5.6

PRIMARY KEY ((weatherstation_id),year,month,day,hour)

Partition Key Clustering Columns

10010:99999-5.1

2005:12:1:10:temp

-5.3

2005:12:1:9:temp

-4.9

Primary key relationship

I have a question!!

What happens if I want to do an adhoc query??

I’ve stored the data partitioned by weather id…

… now I want a report for all stations

I’ve stored the raw weather data…

… now I want rollups/aggregates

Analytics Workload Isolation

Deployment

- Spark worker on each of the Cassandra nodes

- Partitions made up of LOCAL cassandra data

S C

S C

S C

S C

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4


0

500

Node 1

Node 2

Node 3

Node 4

Without vnodes


0

500

Node 1

Node 2

Node 3

Node 4

With vnodes

Cassandra RDD

Each Spark partition is made up of token ranges that live on the same

node

Each Spark partition is made up of Cassandra partitions that are on the

same node

Storing weather dataCREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

(count: 24, mean: 14.428150, stdev: 7.092196, max: 28.034969, min: 0.675863)

Partition key =

Single node

(count: 11242, mean: 8.921956, stdev: 7.428311, max: 29.997986, min: -2.200000)

No partition key =

Every node

Not quick enough?

daily_aggregate_precipCREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);

SELECT precipitation FROM daily_aggregate_precip WHERE weather_station='010010:99999' AND year=2005 AND month=12 AND day>=1 AND day <= 7;

Weather station info

725030:14732,2008,01,01,00,5.0,-3.9,1020.4,270,4.6,2,0.0

Creating a Stream

Saving the raw data

Building an aggregateCREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);

CQL Counter

Want more Spark/C* goodness?

@helenaedelson

ConclusionCassandra = OLTP database for the large scale

Spark can be used to do complex queries in a partition

Or analytical queries for an entire table

Spark streaming to keep tables up to date

Thanks for listeningQuestions later? @chbatey

Spark with Cassandra by Christopher Batey

Data & Analytics

Transcript of Spark with Cassandra by Christopher Batey