Cassandra: Open Source Bigtable + Dynamo

Post on 11-May-2015

24.589 views 5 download

Tags:

description

Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache Incubator. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7975

Transcript of Cassandra: Open Source Bigtable + Dynamo

Cassandra

Jonathan Ellis

Motivation

● Scaling reads to a relational database is hard

● Scaling writes to a relational database is virtually impossible● … and when you do, it usually isn't relational

anymore

The new face of data

● Scale out, not up● Online load balancing, cluster growth● Flexible schema● Key-oriented queries● CAP-aware

CAP theorem

● Pick two of Consistency, Availability, Partition tolerance

Two famous papers

● Bigtable: A distributed storage system for structured data, 2006

● Dynamo: amazon's highly available key-value store, 2007

Two approaches

● Bigtable: “How can we build a distributed db on top of GFS?”

● Dynamo: “How can we build a distributed hash table appropriate for the data center?”

10,000 ft summary

● Dynamo partitioning and replication● Log-structured ColumnFamily data model

similar to Bigtable's

Cassandra highlights

● High availability● Incremental scalability● Eventually consistent● Tunable tradeoffs between consistency

and latency● Minimal administration● No SPF

Dynamo architecture & Lookup

Architecture details

● O(1) node lookup● Explicit replication● Eventually consistent

Architecture layers

Messaging service

Gossip

Failure detection

Cluster state

Partitioner

Replication

Commit log

Memtable

SSTable

Indexes

Compaction

Tombstones

Hinted handoff

Read repair

Bootstrap

Monitoring

Admin tools

Writes

● Any node● Partitioner● Commitlog, memtable● SSTable● Compaction● Wait for W responses

Memtable / SSTable

Commit log

Disk

SSTable format

● Key / data

SSTable Indexes

● Bloom filter● Key● Column

(Similar to Hadoop MapFile / Tfile)

Compaction

● Merge keys● Combine columns● Discard tombstones

Remove

● Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction

● Read repair complicates things a little● Eventually consistent complicates things

more● Solution: configurable delay before

tombstone GC, after which tombstones are not repaired

Cassandra write properties

● No reads● No seeks● Fast● Atomic within ColumnFamily● Always writable

Read path

● Any node● Partitioner● Wait for R responses● Wait for N – R responses in the

background and perform read repair

Cassandra read properties

● Read multiple SSTables● Slower than writes (but still fast)● Seeks can be mitigated with more RAM● Scales to billions of rows

Consistency in a BASE world

● If W + R > N, you will have consistency● W=1, R=N● W=N, R=1● W=Q, R=Q where Q = N / 2 + 1

vs MySQL with 50GB of data

● MySQL● ~300ms write

● ~350ms read

● Cassandra● ~0.12ms write

● ~15ms read

● Achtung!

Data model

● Rows, ColumnFamilies, Columns

ColumnFamilies

keyA column1 column2 column3

keyC column1 column7 column11

Column

Byte[] Name

Byte[] Value

I64 timestamp

Super ColumnFamilies

keyF Super1 Super2

keyJ Super1 Super5

column column column column column column

column column column column column column

Types of queries

● Single column● Slice

● Set of names / range of names

● Simple slice -> columns

● Super slice -> supercolumns

● Key range

Range queries

● Add “master” server● Implement on top of K/V● Order-preserving partitioning

Modification

● Insert / update● Remove● Single column or batch● Specify W, number of nodes to wait for

Thriftstruct Column {   1: binary                        name,   2: binary                        value,   3: i64                           timestamp,}

struct SuperColumn {   1: binary                        name,   2: list<Column>                  columns,}

Column get_column(table, key, column_path, block_for=1)

list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100)

void insert(table, key, column_path, value, timestamp, block_for=0)

void remove(tablename, key, column_path_or_parent, timestamp)

Honestly, Thrift kinda sucks

Example: a multiuser blog

Two queries

- the most recent posts belonging to a given blog, in reverse chronological order

- a single post and its comments, in chronological order

First try

JBE blog

Cassandra is teh awesome BASE FTW

Evan blog

I like kittens And Ruby

post comment comment post comment comment

post comment comment post comment comment

<ColumnFamily

Type="Super"

CompareWith="TimeString"

CompareSubcolumnsWith="UUID"

Name="Blog"/>

Second try

<ColumnFamily

CompareWith="UUIDType"

Name="Blog"/>

JBE blog Cassandra is teh awesome

BASE FTW

Evan blog I like kittens And Ruby

Cassandra is teh awesome

comment comment

Base FTW comment comment

I like kittens

comment comment

And Ruby comment comment

<ColumnFamily

CompareWith="UUIDType"

Name="Comment"/>

Roadmap

Cassandra 0.3

● Remove support● OPP / Range queries● Test suite● Workarounds for JDK bugs● Rudimentary multi-datacenter support

Cassandra 0.4

● Branched May 18● Data file format change to support billions

of rows per node instead of millions● API changes (no more colon delimiters)● Multi-table (keyspace) support● LRU key cache● fsync support● Bootstrap● Web interface

Cassandra 0.5

● Bootstrap● Load balancing

● Closely related to “bootstrap done right”

● Merkle tree repair● Millions of columns per row

● This will require another data format change

● Multiget● Callout support

Users

Production: facebook, RocketFuel

Production RSN: Digg, Rackspace

No date yet: IBM Research, Twitter

Evaluating: 50+ in #cassandra on freenode

More

● Eventual consistency: http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

● Introduction to distributed databases by Todd Lipcon at NoSQL 09: http://www.vimeo.com/5145059

● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAndPresentations

● #cassandra on irc.freenode.net

Cassandra