Spark with Cassandra by Christopher Batey

45
Spark with Cassandra @chbatey

Transcript of Spark with Cassandra by Christopher Batey

Page 1: Spark with Cassandra by Christopher Batey

Spark with Cassandra@chbatey

Page 2: Spark with Cassandra by Christopher Batey

@chbatey

Christopher Batey - @chbatey

Freelance Software engineer/Devops/Architect

Loves:

Breaking software

Distributed systems

Hates:

Fragile software

Untested code :(

Introduction

Page 3: Spark with Cassandra by Christopher Batey

Audience

Page 4: Spark with Cassandra by Christopher Batey

Assumption is…

Page 5: Spark with Cassandra by Christopher Batey

OverviewCassandra architecture

Modelling time series - weather data for many stations

What can be done with pure C*

When to introduce Spark

Page 6: Spark with Cassandra by Christopher Batey

What do we use Spark for?

Batch processing

Machine Learning

Ad-hoc querying of large datasets

Streaming processing

Page 7: Spark with Cassandra by Christopher Batey

What do we use Cassandra for?

Operational Database

OLTP

Page 8: Spark with Cassandra by Christopher Batey

Casandra overview

Page 9: Spark with Cassandra by Christopher Batey

@chbatey

Master slave

Master

Async replication

Slave

Page 10: Spark with Cassandra by Christopher Batey

@chbatey

Sharding

Page 11: Spark with Cassandra by Christopher Batey

@chbatey

The other way

Page 12: Spark with Cassandra by Christopher Batey

@chbatey

Consistent hashing

jim age: 36 car: ford gender: M

carol age: 37 car: bmw gender: F

johnny age: 12 gender: M

suzy age: 10 gender: F

Partition Key Hash value

jim 350

carol 998

johnny 50

suzy 600

Partition Key

Page 13: Spark with Cassandra by Christopher Batey

999

49

0

50

A

B

C

D249750

749

250

B

CD

A

Page 14: Spark with Cassandra by Christopher Batey

ExampleNode Start range End range Primary

keyHash value

A 0 249 johnny 50

B 250 499 jim 350

C 500 749 suzy 600

D 750 999 carol 998

Page 15: Spark with Cassandra by Christopher Batey

@chbatey

Fault tolerance

Replicate each price of data on multiple nodes

Keep replicas on different racks

Datacenter aware

Page 16: Spark with Cassandra by Christopher Batey

DC2

client

RF3 RF3

CC

WRITE CL = 1 We have

replication!

DC1

Page 17: Spark with Cassandra by Christopher Batey

Storing weather dataCREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Page 18: Spark with Cassandra by Christopher Batey

@chbatey

Primary key relationshipPRIMARY KEY ((weatherstation_id),year,month,day,hour)

Page 19: Spark with Cassandra by Christopher Batey

Data Localityweatherstation_id=‘10010:99999’ ?

1000 Node Cluster

You are here!

Page 20: Spark with Cassandra by Christopher Batey

Primary key relationshipPRIMARY KEY ((weatherstation_id),year,month,day,hour)

Partition Key Clustering Columns

10010:99999

Page 21: Spark with Cassandra by Christopher Batey

2005:12:1:8:temp 2005:12:1:7:temp

-5.6

PRIMARY KEY ((weatherstation_id),year,month,day,hour)

Partition Key Clustering Columns

10010:99999-5.1

2005:12:1:10:temp

-5.3

2005:12:1:9:temp

-4.9

Primary key relationship

Page 22: Spark with Cassandra by Christopher Batey

I have a question!!

What happens if I want to do an adhoc query??

Page 23: Spark with Cassandra by Christopher Batey

I’ve stored the data partitioned by weather id…

… now I want a report for all stations

Page 24: Spark with Cassandra by Christopher Batey

I’ve stored the raw weather data…

… now I want rollups/aggregates

Page 25: Spark with Cassandra by Christopher Batey

Analytics Workload Isolation

Page 26: Spark with Cassandra by Christopher Batey

Deployment

- Spark worker on each of the Cassandra nodes

- Partitions made up of LOCAL cassandra data

S C

S C

S C

S C

Page 27: Spark with Cassandra by Christopher Batey

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

Page 28: Spark with Cassandra by Christopher Batey

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

Without vnodes

Page 29: Spark with Cassandra by Christopher Batey

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

With vnodes

Page 30: Spark with Cassandra by Christopher Batey

Cassandra RDD

Page 31: Spark with Cassandra by Christopher Batey

Each Spark partition is made up of token ranges that live on the same

node

Page 32: Spark with Cassandra by Christopher Batey

Each Spark partition is made up of Cassandra partitions that are on the

same node

Page 33: Spark with Cassandra by Christopher Batey

Storing weather dataCREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Page 34: Spark with Cassandra by Christopher Batey

(count: 24, mean: 14.428150, stdev: 7.092196, max: 28.034969, min: 0.675863)

Partition key =

Single node

Page 35: Spark with Cassandra by Christopher Batey

(count: 11242, mean: 8.921956, stdev: 7.428311, max: 29.997986, min: -2.200000)

No partition key =

Every node

Page 36: Spark with Cassandra by Christopher Batey

Not quick enough?

Page 37: Spark with Cassandra by Christopher Batey

daily_aggregate_precipCREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);

SELECT precipitation FROM daily_aggregate_precip WHERE weather_station='010010:99999' AND year=2005 AND month=12 AND day>=1 AND day <= 7;

Page 38: Spark with Cassandra by Christopher Batey

Weather station info

Page 39: Spark with Cassandra by Christopher Batey

725030:14732,2008,01,01,00,5.0,-3.9,1020.4,270,4.6,2,0.0

Page 40: Spark with Cassandra by Christopher Batey

Creating a Stream

Page 41: Spark with Cassandra by Christopher Batey

Saving the raw data

Page 42: Spark with Cassandra by Christopher Batey

Building an aggregateCREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);

CQL Counter

Page 43: Spark with Cassandra by Christopher Batey

Want more Spark/C* goodness?

@helenaedelson

Page 44: Spark with Cassandra by Christopher Batey

ConclusionCassandra = OLTP database for the large scale

Spark can be used to do complex queries in a partition

Or analytical queries for an entire table

Spark streaming to keep tables up to date

Page 45: Spark with Cassandra by Christopher Batey

Thanks for listeningQuestions later? @chbatey