Cassandra's Sweet Spot - an introduction to Apache Cassandra
-
Upload
dave-gardner -
Category
Technology
-
view
14.067 -
download
9
description
Transcript of Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra’s sweet spot
Dave Gardner@davegardnerisme
jobs.hailocab.com
Looking for an expert backend Java dev – speak to me!
meetup.com/Cassandra-London
Next event 21st November
Building applications with Cassandra
• Key features
• Creating an application
• Data modeling
Comparing Cassandra with X
“Can someone quickly explain the differences between the two? Other than the fact that MongoDB supports ad-hoc querying I don't know whats different. It also appears (using google trends) that MongoDB seems to be growing while Cassandra is dying off. Is this the case?”
27th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/7773
Comparing Cassandra with X
“They have approximately nothing in common. And, no, Cassandra is definitely not dying off.”
28th July 2010http://comments.gmane.org/gmane.comp.db.cassandra.user/7773
Top Tip #1
To use a NoSQL solution effectively, we need to identify it's sweet spot.
Top Tip #1
To use a NoSQL solution effectively, we need to identify it's sweet spot.
This means learning about each solution; how is it designed? what algorithms does it use?http://www.alberton.info/nosql_databases_what_when_why_phpuk2011.html
Comparing Cassandra with X
“they say … I can’t decide between this project and this project even though they look nothing like each other. And the fact that you can’t decide indicates that you don’t actually have a problem that requires them.”
Benjamin Black – NoSQL Tapes (at 30:15)
http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip
Headline features
1. Elastic
Read and write throughput increases linearly as new machines are added
http://cassandra.apache.org/
Headline features
2. Decentralised
Fault tolerant with no single point of failure; no “master” node
http://cassandra.apache.org/
The dynamo paper
• Consistent hashing• Vector clocks• Gossip protocol• Hinted handoff• Read repair
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
The dynamo paper
RF = 3#1
#4
#6
#2
#3
Client
#5
Coordinator
Headline features
3. Rich data model
Column based, range slices, column slices, secondary indexes, counters, expiring columns
http://cassandra.apache.org/
The big table paper
• Sparse "columnar" data model• SSTable disk storage• Append-only commit log• Memtable (buffer and sort)• Immutable SSTable files• Compactionhttp://labs.google.com/papers/bigtable-osdi06.pdfhttp://www.slideshare.net/geminimobile/bigtable-4820829
Row Key
The big table paper
Name
Value
Column
Name
Value
Column
Name
Value
Column
Column Family
Headline features
4. You're in control
Tunable consistency, per operation
http://cassandra.apache.org/
Consistency levels
How many replicas must respond to declare success?
Consistency levels: write operations
Level Description
ANY One node, including hinted handoff
ONE One node
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas
http://wiki.apache.org/cassandra/API#Write
Consistency levels: read operations
Level Description
ONE 1st Response
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas
http://wiki.apache.org/cassandra/API#Read
Headline features
5. Performant
Well known for high write performance
http://www.datastax.com/docs/1.0/introduction/index#core-strengths-of-cassandra
Benchmark*
http://blog.cubrid.org/dev-platform/nosql-benchmarking/
* Add pinch of salt
Recap: headline features
1. Elastic
2. Decentralised
3. Rich data model
4. You’re in control (tunable consistency)
5. Performant
A simple ad-targeting application
Some ads
Our user knowledge
Choose which ad to show
A simple ad-targeting application
Allow us to capture user behaviour/data via “pixels” - placing users into segments (different buckets)
http://pixel.wehaveyourkidneys.com/add.php?add=foo
A simple ad-targeting application
Record clicks and impressions of each ad; storing data per-ad and per-segment
http://pixel.wehaveyourkidneys.com/adImpression.php?ad=1http://pixel.wehaveyourkidneys.com/adClick.php?ad=1
A simple ad-targeting application
Real-time ad performance analytics, broken down by segment(which segments are performing well?)
http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
A simple ad-targeting application
Recommendations based on best-performing ads
(this is left as an exercise for the reader)
Additional requirements
• Large number of users
• High volume of impressions
• Highly available – downtime is money
A good fit for Cassandra?
Yes!
Big data, high availability and lots of writes are all good signs that Cassandra will fit well.
http://www.nosqldatabases.com/main/2010/10/19/what-is-cassandra-good-for.html
A good fit for Cassandra?
Although there are many things that people are using Cassandra for.
Highly available HTTP request routing (tiny data!)
http://blip.tv/datastax/highly-available-http-request-routing-dns-using-cassandra-5501901
Top Tip #2
Cassandra is an excellent fit where availability matters, where there is a lot of data or where you have a large number of write operations.
Demo
Live demo before we start
Data modeling
Start from your queries, work backwards
http://www.slideshare.net/mattdennis/cassandra-data-modelinghttp://blip.tv/datastax/data-modeling-workshop-5496906
Data model basics: conflict resolution
Per-column timestamp-based conflict resolution
http://cassandra.apache.org/
{ column: foo, value: bar, timestamp: 1000}
{ column: foo, value: zing, timestamp: 1001}
Data model basics: conflict resolution
Per-column timestamp-based conflict resolution
http://cassandra.apache.org/
{ column: foo, value: bar, timestamp: 1000}
{ column: foo, value: zing, timestamp: 1001}
Data model basics: column ordering
Columns ordered at time of writing, according to Column Family schema
http://cassandra.apache.org/
{ column: zebra, value: foo, timestamp: 1000}
{ column: badger, value: foo, timestamp: 1001}
Data model basics: column ordering
Columns ordered at time of writing, according to Column Family schema
http://cassandra.apache.org/
{ badger: foo, zebra: foo}
with AsciiType column schema
Data modeling: user segments
Add user to bucket X, with expiry time YWhich buckets is user X in?
["user"][<uuid>][<bucketId>] = 1
[CF] [rowKey] [columnName] = value
Data modeling: user segments
user Column Family:
[f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1[f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1[503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1
Q: Is user in segment X?A: Single column fetch
Data modeling: user segments
user Column Family:
[f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1[f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1[503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1
Q: Which segments is user X in?A: Column slice fetch
Top Tip #3
With column slices, we get the columns back ordered, according to our schema
We cannot do the same for rows however, unless we use the Order Preserving Partitioner
Top Tip #4
Don’t use the Order Preserving Partitioner unless you absolutely have to
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/
Data modeling: user segments
Add user to bucket X, with expiry time Y
Which buckets is user X in?
["user"][<uuid>][<bucketId>] = 1
[CF] [rowKey] [columnName] = value
Expiring columns
An expiring column will be automatically deleted after n seconds
http://cassandra.apache.org/
Data modeling: user segments
$pool = new ConnectionPool( 'whyk', array('localhost') );$users = new ColumnFamily($pool, 'users');$users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires );
Using phpcassa client: https://github.com/thobbs/phpcassa
Data modeling: user segments
UPDATE users USING TTL = 3600SET 'foo' = 1WHERE KEY = 'f97be9cc-5255-4578-8813-76701c0945bd'
Using CQL http://www.datastax.com/dev/blog/what%E2%80%99s-new-in-cassandra-0-8-part-1-cql-the-cassandra-query-language
http://www.datastax.com/docs/1.0/references/cql
Top Tip #5
Try to exploit Cassandra’s columnar data model; avoid read-before write and locking by safely mutating individual columns
Data modeling: ad performance
Track overall ad performance; how many clicks/impressions per ad?
["ads"][<adId>][<stamp>]["click"] = #["ads"][<adId>][<stamp>]["impression"] = #
[CF] [Row] [S.Col] [Col] = value
Using super columns
Top Tip #6
Friends don’t let friends use Super Columns.
http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-the-unwary/
Data modeling: ad performance
Try again using regular columns:
["ads"][<adId>][<stamp>-"click"] = #["ads"][<adId>][<stamp>-"impression"] = #
[CF] [Row] [Col] = value
Data modeling: ad performance
ads Column Family:
[1][2011103015-click] = 1[1][2011103015-impression] = 3434[1][2011103016-click] = 12[1][2011103016-impression] = 5411[1][2011103017-click] = 2[1][2011103017-impression] = 345
Q: Get performance of ad X between two date/timesA: Column slice against single row specifying a start stamp and end stamp + 1
Think carefully about your data
This scheme works because I’m assuming each ad has a relatively short lifespan. This means that there are lots of rows and hence the load is spread.
Other options:http://rubyscale.com/2011/basic-time-series-with-cassandra/
Counters
• Distributed atomic counters
• Easy to use
• Not idempotent
http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters
Data modeling: ad performance
$stamp = date('YmdH');$ads->add( $adId, // row key "$stamp-impression", // column 1 // increment );
We’ll store performance metrics in hour buckets for graphing.
Data modeling: ad performance
UPDATE adsSET '2011103015-impression' = '2011103015-impression' + 1WHERE KEY = '1’
Data modeling: performance/segment
We can add in another dimension to our stats so we can breakdown by segment.
["ads"][<adId>] [<stamp>-<segment>-"click"] = #
[CF] [Row] [Col] = value
Data modeling: performance/segment
ads Column Family:
[1][2011103015-bar-click] = 1[1][2011103015-bar-impression] = 3434[1][2011103015-foo-click] = 12[1][2011103015-foo-impression] = 5411[1][2011103016-bar-click] = 2
Q: Get performance of ad X between two date/times, split by segmentA: Column slice against single row specifying a start stamp and end stamp + 1
Data modeling: performance/segment
$stamp = date('YmdH');$ads->add( "$adId-segments", // row key "$stamp-$segment-impression", // column 1 // incr );
We’ll store performance metrics in hour buckets for graphing.
Data modeling: segment stats
Track overall clicks/impressions per bucket; which buckets are most clicky?
["segments"][<adId>-"segments"] [<stamp>-<segment>-"click"] = #
[CF] [Row] [Col] = value
Recap: Data modeling
• Think about the queries, work backwards
• Don’t overuse single rows; try to spread the load
• Don’t use super columns
• Ask on IRC! #cassandra
Recap: Common data modeling patterns
1. Using column names with no value
[cf][rowKey][columnName] = 1
Recap: Common data modeling patterns
2. Counters
[cf][rowKey][columnName]++
And also…
3. Serialising a whole object
[cf][rowKey][columnName] = { foo: 3, bar: 11 }
There’s more: Brisk
Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra
DataStax now offer this functionality in their “Enterprise” product
http://www.datastax.com/products/enterprise
Hive
CREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );
SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;
There’s more: Supercharged Cassandra
Acunu have reengineered the entire Unix storage stack, optimised specifically for Big Data workloads
Includes instant snapshot of CFs
http://www.acunu.com/products/choosing-cassandra/
In conclusion
Cassandra is founded on sound design principles
In conclusion
The Cassandra data model, sometimes mentioned as a weakness, is incredibly powerful
In conclusion
The clients are getting better; CQL is a step forward
In conclusion
Hadoop integration means we can analyse data directly from a Cassandra cluster
In conclusion
Cassandra’s sweet spot is highly available “big data” (especially time-series) with large numbers of writes
Thanks
Learn more about Cassandrameetup.com/Cassandra-London
Checkout the code https://github.com/davegardnerisme/we-have-your-kidneys
Watch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations