Cassandra: Open Source Bigtable + Dynamo

44
Cassandra Jonathan Ellis

description

Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache Incubator. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7975

Transcript of Cassandra: Open Source Bigtable + Dynamo

Page 1: Cassandra: Open Source Bigtable + Dynamo

Cassandra

Jonathan Ellis

Page 2: Cassandra: Open Source Bigtable + Dynamo

Motivation

● Scaling reads to a relational database is hard

● Scaling writes to a relational database is virtually impossible● … and when you do, it usually isn't relational

anymore

Page 3: Cassandra: Open Source Bigtable + Dynamo

The new face of data

● Scale out, not up● Online load balancing, cluster growth● Flexible schema● Key-oriented queries● CAP-aware

Page 4: Cassandra: Open Source Bigtable + Dynamo

CAP theorem

● Pick two of Consistency, Availability, Partition tolerance

Page 5: Cassandra: Open Source Bigtable + Dynamo

Two famous papers

● Bigtable: A distributed storage system for structured data, 2006

● Dynamo: amazon's highly available key-value store, 2007

Page 6: Cassandra: Open Source Bigtable + Dynamo

Two approaches

● Bigtable: “How can we build a distributed db on top of GFS?”

● Dynamo: “How can we build a distributed hash table appropriate for the data center?”

Page 7: Cassandra: Open Source Bigtable + Dynamo

10,000 ft summary

● Dynamo partitioning and replication● Log-structured ColumnFamily data model

similar to Bigtable's

Page 8: Cassandra: Open Source Bigtable + Dynamo

Cassandra highlights

● High availability● Incremental scalability● Eventually consistent● Tunable tradeoffs between consistency

and latency● Minimal administration● No SPF

Page 9: Cassandra: Open Source Bigtable + Dynamo
Page 10: Cassandra: Open Source Bigtable + Dynamo
Page 11: Cassandra: Open Source Bigtable + Dynamo
Page 12: Cassandra: Open Source Bigtable + Dynamo
Page 13: Cassandra: Open Source Bigtable + Dynamo

Dynamo architecture & Lookup

Page 14: Cassandra: Open Source Bigtable + Dynamo

Architecture details

● O(1) node lookup● Explicit replication● Eventually consistent

Page 15: Cassandra: Open Source Bigtable + Dynamo

Architecture layers

Messaging service

Gossip

Failure detection

Cluster state

Partitioner

Replication

Commit log

Memtable

SSTable

Indexes

Compaction

Tombstones

Hinted handoff

Read repair

Bootstrap

Monitoring

Admin tools

Page 16: Cassandra: Open Source Bigtable + Dynamo

Writes

● Any node● Partitioner● Commitlog, memtable● SSTable● Compaction● Wait for W responses

Page 17: Cassandra: Open Source Bigtable + Dynamo

Memtable / SSTable

Commit log

Disk

Page 18: Cassandra: Open Source Bigtable + Dynamo

SSTable format

● Key / data

Page 19: Cassandra: Open Source Bigtable + Dynamo

SSTable Indexes

● Bloom filter● Key● Column

(Similar to Hadoop MapFile / Tfile)

Page 20: Cassandra: Open Source Bigtable + Dynamo

Compaction

● Merge keys● Combine columns● Discard tombstones

Page 21: Cassandra: Open Source Bigtable + Dynamo

Remove

● Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction

● Read repair complicates things a little● Eventually consistent complicates things

more● Solution: configurable delay before

tombstone GC, after which tombstones are not repaired

Page 22: Cassandra: Open Source Bigtable + Dynamo

Cassandra write properties

● No reads● No seeks● Fast● Atomic within ColumnFamily● Always writable

Page 23: Cassandra: Open Source Bigtable + Dynamo

Read path

● Any node● Partitioner● Wait for R responses● Wait for N – R responses in the

background and perform read repair

Page 24: Cassandra: Open Source Bigtable + Dynamo

Cassandra read properties

● Read multiple SSTables● Slower than writes (but still fast)● Seeks can be mitigated with more RAM● Scales to billions of rows

Page 25: Cassandra: Open Source Bigtable + Dynamo

Consistency in a BASE world

● If W + R > N, you will have consistency● W=1, R=N● W=N, R=1● W=Q, R=Q where Q = N / 2 + 1

Page 26: Cassandra: Open Source Bigtable + Dynamo

vs MySQL with 50GB of data

● MySQL● ~300ms write

● ~350ms read

● Cassandra● ~0.12ms write

● ~15ms read

● Achtung!

Page 27: Cassandra: Open Source Bigtable + Dynamo

Data model

● Rows, ColumnFamilies, Columns

Page 28: Cassandra: Open Source Bigtable + Dynamo

ColumnFamilies

keyA column1 column2 column3

keyC column1 column7 column11

Column

Byte[] Name

Byte[] Value

I64 timestamp

Page 29: Cassandra: Open Source Bigtable + Dynamo

Super ColumnFamilies

keyF Super1 Super2

keyJ Super1 Super5

column column column column column column

column column column column column column

Page 30: Cassandra: Open Source Bigtable + Dynamo

Types of queries

● Single column● Slice

● Set of names / range of names

● Simple slice -> columns

● Super slice -> supercolumns

● Key range

Page 31: Cassandra: Open Source Bigtable + Dynamo

Range queries

● Add “master” server● Implement on top of K/V● Order-preserving partitioning

Page 32: Cassandra: Open Source Bigtable + Dynamo

Modification

● Insert / update● Remove● Single column or batch● Specify W, number of nodes to wait for

Page 33: Cassandra: Open Source Bigtable + Dynamo

Thriftstruct Column {   1: binary                        name,   2: binary                        value,   3: i64                           timestamp,}

struct SuperColumn {   1: binary                        name,   2: list<Column>                  columns,}

Column get_column(table, key, column_path, block_for=1)

list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100)

void insert(table, key, column_path, value, timestamp, block_for=0)

void remove(tablename, key, column_path_or_parent, timestamp)

Page 34: Cassandra: Open Source Bigtable + Dynamo

Honestly, Thrift kinda sucks

Page 35: Cassandra: Open Source Bigtable + Dynamo

Example: a multiuser blog

Two queries

- the most recent posts belonging to a given blog, in reverse chronological order

- a single post and its comments, in chronological order

Page 36: Cassandra: Open Source Bigtable + Dynamo

First try

JBE blog

Cassandra is teh awesome BASE FTW

Evan blog

I like kittens And Ruby

post comment comment post comment comment

post comment comment post comment comment

<ColumnFamily

Type="Super"

CompareWith="TimeString"

CompareSubcolumnsWith="UUID"

Name="Blog"/>

Page 37: Cassandra: Open Source Bigtable + Dynamo

Second try

<ColumnFamily

CompareWith="UUIDType"

Name="Blog"/>

JBE blog Cassandra is teh awesome

BASE FTW

Evan blog I like kittens And Ruby

Cassandra is teh awesome

comment comment

Base FTW comment comment

I like kittens

comment comment

And Ruby comment comment

<ColumnFamily

CompareWith="UUIDType"

Name="Comment"/>

Page 38: Cassandra: Open Source Bigtable + Dynamo

Roadmap

Page 39: Cassandra: Open Source Bigtable + Dynamo

Cassandra 0.3

● Remove support● OPP / Range queries● Test suite● Workarounds for JDK bugs● Rudimentary multi-datacenter support

Page 40: Cassandra: Open Source Bigtable + Dynamo

Cassandra 0.4

● Branched May 18● Data file format change to support billions

of rows per node instead of millions● API changes (no more colon delimiters)● Multi-table (keyspace) support● LRU key cache● fsync support● Bootstrap● Web interface

Page 41: Cassandra: Open Source Bigtable + Dynamo

Cassandra 0.5

● Bootstrap● Load balancing

● Closely related to “bootstrap done right”

● Merkle tree repair● Millions of columns per row

● This will require another data format change

● Multiget● Callout support

Page 42: Cassandra: Open Source Bigtable + Dynamo

Users

Production: facebook, RocketFuel

Production RSN: Digg, Rackspace

No date yet: IBM Research, Twitter

Evaluating: 50+ in #cassandra on freenode

Page 43: Cassandra: Open Source Bigtable + Dynamo

More

● Eventual consistency: http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

● Introduction to distributed databases by Todd Lipcon at NoSQL 09: http://www.vimeo.com/5145059

● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAndPresentations

● #cassandra on irc.freenode.net

Page 44: Cassandra: Open Source Bigtable + Dynamo

Cassandra