Spark with Cassandra@chbatey
@chbatey
Christopher Batey - @chbatey
Freelance Software engineer/Devops/Architect
Loves:
Breaking software
Distributed systems
Hates:
Fragile software
Untested code :(
Introduction
Audience
Assumption is…
OverviewCassandra architecture
Modelling time series - weather data for many stations
What can be done with pure C*
When to introduce Spark
What do we use Spark for?
Batch processing
Machine Learning
Ad-hoc querying of large datasets
Streaming processing
What do we use Cassandra for?
Operational Database
OLTP
Casandra overview
@chbatey
Master slave
Master
Async replication
Slave
@chbatey
Sharding
@chbatey
The other way
@chbatey
Consistent hashing
jim age: 36 car: ford gender: M
carol age: 37 car: bmw gender: F
johnny age: 12 gender: M
suzy age: 10 gender: F
Partition Key Hash value
jim 350
carol 998
johnny 50
suzy 600
Partition Key
999
49
0
50
A
B
C
D249750
749
250
B
CD
A
ExampleNode Start range End range Primary
keyHash value
A 0 249 johnny 50
B 250 499 jim 350
C 500 749 suzy 600
D 750 999 carol 998
@chbatey
Fault tolerance
Replicate each price of data on multiple nodes
Keep replicas on different racks
Datacenter aware
DC2
client
RF3 RF3
CC
WRITE CL = 1 We have
replication!
DC1
Storing weather dataCREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
@chbatey
Primary key relationshipPRIMARY KEY ((weatherstation_id),year,month,day,hour)
Data Localityweatherstation_id=‘10010:99999’ ?
1000 Node Cluster
You are here!
Primary key relationshipPRIMARY KEY ((weatherstation_id),year,month,day,hour)
Partition Key Clustering Columns
10010:99999
2005:12:1:8:temp 2005:12:1:7:temp
-5.6
PRIMARY KEY ((weatherstation_id),year,month,day,hour)
Partition Key Clustering Columns
10010:99999-5.1
2005:12:1:10:temp
-5.3
2005:12:1:9:temp
-4.9
Primary key relationship
I have a question!!
What happens if I want to do an adhoc query??
I’ve stored the data partitioned by weather id…
… now I want a report for all stations
I’ve stored the raw weather data…
… now I want rollups/aggregates
Analytics Workload Isolation
Deployment
- Spark worker on each of the Cassandra nodes
- Partitions made up of LOCAL cassandra data
S C
S C
S C
S C
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
With vnodes
Cassandra RDD
Each Spark partition is made up of token ranges that live on the same
node
Each Spark partition is made up of Cassandra partitions that are on the
same node
Storing weather dataCREATE TABLE raw_weather_data ( weather_station text, year int, month int, day int, hour int, temp double, PRIMARY KEY ((weather_station), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
(count: 24, mean: 14.428150, stdev: 7.092196, max: 28.034969, min: 0.675863)
Partition key =
Single node
(count: 11242, mean: 8.921956, stdev: 7.428311, max: 29.997986, min: -2.200000)
No partition key =
Every node
Not quick enough?
daily_aggregate_precipCREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
SELECT precipitation FROM daily_aggregate_precip WHERE weather_station='010010:99999' AND year=2005 AND month=12 AND day>=1 AND day <= 7;
Weather station info
725030:14732,2008,01,01,00,5.0,-3.9,1020.4,270,4.6,2,0.0
Creating a Stream
Saving the raw data
Building an aggregateCREATE TABLE daily_aggregate_precip ( weather_station text, year int, month int, day int, precipitation counter, PRIMARY KEY ((weather_station), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
CQL Counter
Want more Spark/C* goodness?
@helenaedelson
ConclusionCassandra = OLTP database for the large scale
Spark can be used to do complex queries in a partition
Or analytical queries for an entire table
Spark streaming to keep tables up to date
Thanks for listeningQuestions later? @chbatey
Top Related